18% of Github Projects Have CI
In this post, I’ll share my analysis of the adoption of continuous integration (CI) in public Github repositories, based on usage of the following CI vendors:
I’m the first to admit that this is not a comprehensive list of CI vendors.
It’s also obvious that using a CI vendor for executing your builds is not a requirement for having CI.
That being said, we can dig into public data provided by Github to get a lower-bound estimate.
Github has provided a public dataset of code that can be queried via Google’s BigQuery database service.
I followed Google’s “Codelab” for querying this dataset using SQL, running a basic query to identify repositories in their test dataset.
In order to identify whether a repository uses a given CI vendor we must assemble a list of rules for identifying repositories that are configured for specific CI vendors.
For example, in modern versions of Jenkins it’s encouraged to include a
Jenkinsfile in the project root of your repository.
If this file is present then Jenkins will process its contents as a manifest for how to build your software.
This type of file-based configuration is not exclusive to Jenkins, and it allows us to count repositories that are using file-based manifests for configuring CI.
Here is a list of rules for detecting if a repository is configured for CI with the given vendor:
|CI Vendor||File pattern|
|Google Cloud Build||cloudbuild.yaml|
Here is the SQL query I used to get a count of repositories that had each of these files:
SELECT CASE WHEN path = '.travis.yml' THEN 'TravisCI' WHEN path = '.circleci/config.yml' OR path = 'circle.yml' THEN 'CircleCI' WHEN path = 'Jenkinsfile' THEN 'Jenkins' WHEN path LIKE '.github/workflows%' THEN 'Github Actions' WHEN path = 'azure-pipelines.yml' THEN 'Azure Pipelines' WHEN path LIKE '.teamcity/%' THEN 'TeamCity' WHEN path = 'cloudbuild.yaml' THEN 'Google Cloud Build' END, COUNT(distinct repo_name) AS num_duplicates FROM `bigquery-public-data.github_repos.files` GROUP BY 1 ORDER BY 2 DESC
|CI Vendor||Number of Repos||Percentage of Repos|
|Google Cloud Build|
That’s a grand total of 18.1% of repositories in this dataset having been setup with any CI vendor at all.
The market seems to be dominated by TravisCI, having a
.travis.yml file present in 16.9% of repos.
Here are some data visualizations to help illustrate the results:
With only 18% adoption, that leaves 82% of these public repositories not using popular CI vendors.
This is a huge opportunity to make CI a default and easier to adopt.
Areas of future research:
- relationship between CI adoption and popularity of repository
- analysis of a repository’s dependencies & their use of CI
- analysis of CI adoption by programming language
- when in the repository’s timeline was CI introduced (at the beginning? later?)
One area I am most curious about is automatically profiling a software’s CI runtimes and factors that introduce flakiness to the build.
These two characteristics of software reflect:
- the mental health of software engineers working on it
- the ability to make changes quickly and confidently to it
Please do email me with comments/suggestions, etc. at email@example.com