February 02, 2020

18% of Github Projects Have CI

Introduction

In this post, I’ll share my analysis of the adoption of continuous integration (CI) in public Github repositories, based on usage of the following CI vendors:

I’m the first to admit that this is not a comprehensive list of CI vendors.

It’s also obvious that using a CI vendor for executing your builds is not a requirement for having CI.

That being said, we can dig into public data provided by Github to get a lower-bound estimate.

Methodology

Github has provided a public dataset of code that can be queried via Google’s BigQuery database service.

I followed Google’s “Codelab” for querying this dataset using SQL, running a basic query to identify repositories in their test dataset.

In order to identify whether a repository uses a given CI vendor we must assemble a list of rules for identifying repositories that are configured for specific CI vendors.

For example, in modern versions of Jenkins it’s encouraged to include a Jenkinsfile in the project root of your repository.

If this file is present then Jenkins will process its contents as a manifest for how to build your software.

This type of file-based configuration is not exclusive to Jenkins, and it allows us to count repositories that are using file-based manifests for configuring CI.

Here is a list of rules for detecting if a repository is configured for CI with the given vendor:

CI Vendor File pattern
TravisCI .travis.yml
CircleCI .circleci/config.yml
Jenkins Jenkinsfile
Github Actions .github/workflows
Azure Pipelines azure-pipelines.yml
TeamCity .teamcity/
Google Cloud Build cloudbuild.yaml

Here is the SQL query I used to get a count of repositories that had each of these files:

SELECT 
CASE
  WHEN path = '.travis.yml' THEN 'TravisCI'
  WHEN path = '.circleci/config.yml' OR path = 'circle.yml' THEN 'CircleCI'
  WHEN path = 'Jenkinsfile' THEN 'Jenkins'
  WHEN path LIKE '.github/workflows%' THEN 'Github Actions'
  WHEN path = 'azure-pipelines.yml' THEN 'Azure Pipelines'
  WHEN path LIKE '.teamcity/%' THEN 'TeamCity'
  WHEN path = 'cloudbuild.yaml' THEN 'Google Cloud Build'
END,
COUNT(distinct repo_name) AS num_duplicates
FROM `bigquery-public-data.github_repos.files`
GROUP BY 1
ORDER BY 2 DESC

BigQuery link

Results

CI Vendor Number of Repos Percentage of Repos
Total

3345134

1

TravisCI

565369

16.90%

CircleCI

28608

0.86%

Github Actions

5532

0.17%

Jenkins

5389

0.16%

Azure Pipelines

1829

0.05%

Google Cloud Build

253

0.01%

TeamCity

72

0.00%

That’s a grand total of 18.1% of repositories in this dataset having been setup with any CI vendor at all.

The market seems to be dominated by TravisCI, having a .travis.yml file present in 16.9% of repos.

Here are some data visualizations to help illustrate the results:

alt_text

alt_text

Discussion

With only 18% adoption, that leaves 82% of these public repositories not using popular CI vendors.

This is a huge opportunity to make CI a default and easier to adopt.

Areas of future research:

  • relationship between CI adoption and popularity of repository
  • analysis of a repository’s dependencies & their use of CI
  • analysis of CI adoption by programming language
  • when in the repository’s timeline was CI introduced (at the beginning? later?)

One area I am most curious about is automatically profiling a software’s CI runtimes and factors that introduce flakiness to the build.

These two characteristics of software reflect:

  1. the mental health of software engineers working on it
  2. the ability to make changes quickly and confidently to it

Please do email me with comments/suggestions, etc. at max.mautner@gmail.com