In this post, I’ll share my analysis of the adoption of continuous integration (CI) in public Github repositories, based on usage of the following CI vendors:
I’m the first to admit that this is not a comprehensive list of CI vendors.
It’s also obvious that using a CI vendor for executing your builds is not a requirement for having CI.
That being said, we can dig into public data provided by Github to get a lower-bound estimate.
Github has provided a public dataset of code that can be queried via Google’s BigQuery database service.
I followed Google’s “Codelab” for querying this dataset using SQL, running a basic query to identify repositories in their test dataset.
In order to identify whether a repository uses a given CI vendor we must assemble a list of rules for identifying repositories that are configured for specific CI vendors.
For example, in modern versions of Jenkins it’s encouraged to include a Jenkinsfile
in the project root of your repository.
If this file is present then Jenkins will process its contents as a manifest for how to build your software.
This type of file-based configuration is not exclusive to Jenkins, and it allows us to count repositories that are using file-based manifests for configuring CI.
Here is a list of rules for detecting if a repository is configured for CI with the given vendor:
Here is the SQL query I used to get a count of repositories that had each of these files:
SELECT
CASE
WHEN path = '.travis.yml' THEN 'TravisCI'
WHEN path = '.circleci/config.yml' OR path = 'circle.yml' THEN 'CircleCI'
WHEN path = 'Jenkinsfile' THEN 'Jenkins'
WHEN path LIKE '.github/workflows%' THEN 'Github Actions'
WHEN path = 'azure-pipelines.yml' THEN 'Azure Pipelines'
WHEN path LIKE '.teamcity/%' THEN 'TeamCity'
WHEN path = 'cloudbuild.yaml' THEN 'Google Cloud Build'
END,
COUNT(distinct repo_name) AS num_duplicates
FROM `bigquery-public-data.github_repos.files`
GROUP BY 1
ORDER BY 2 DESC
CI Vendor | Number of Repos | Percentage of Repos |
Total | 3345134 |
1 |
TravisCI | 565369 |
16.90% |
CircleCI | 28608 |
0.86% |
Github Actions | 5532 |
0.17% |
Jenkins | 5389 |
0.16% |
Azure Pipelines | 1829 |
0.05% |
Google Cloud Build | 253 |
0.01% |
TeamCity | 72 |
0.00% |
That’s a grand total of 18.1% of repositories in this dataset having been setup with any CI vendor at all.
The market seems to be dominated by TravisCI, having a .travis.yml
file present in 16.9% of repos.
Here are some data visualizations to help illustrate the results:
With only 18% adoption, that leaves 82% of these public repositories not using popular CI vendors.
This is a huge opportunity to make CI a default and easier to adopt.
Areas of future research:
One area I am most curious about is automatically profiling a software’s CI runtimes and factors that introduce flakiness to the build.
These two characteristics of software reflect:
Please do email me with comments/suggestions, etc. at max.mautner@gmail.com