FG
💻 Software☁️ Cloud & DevOpsMicrosoft

Very slow queuing with plenty of idle runners available

Fresh3 days ago
Mar 14, 20260 views
Confidence Score62%
62%

Problem

Originally posted here: https://github.community/t/very-slow-queuing-behavior-when-idle-runners-are-available/127674 Describe the bug When no other builds are running (all my runners are idle), there is very delayed behavior from GH actions before builds even start. This is especially noticeable with, for example, a 4^4 matrix (256 checks), even if you have 256 idle self-hosted runners. The UI shows “X queued checks” at a rate of around 4 per second (ie, “4 queued checks”, “8 queued checks”, etc), before it finally gets to 256 checks queued. It takes a full 1min 40sec before the first of my runners even receives a message and starts building. It takes 12-13min for the entire run to be marked as finished, even if each build does no work and completes in 1sec or less. To Reproduce 1. Register (and run) 256 self-hosted runners 2. Run a workflow that uses a 4^4 matrix: [code block] 3. Observe that it takes a long time before the first build message is sent to a runner 4. Observe how, even after all checks are completed, it still takes many minutes (10?) for the entire workflow to be marked as finished You can also observe similar behavior with fewer checks – eg, even just 16 runners and a 4^2 matrix. Even then, checks are queued before the first build will start – and there's a noticeable delay after the 16th check has finished before the whole workflow is marked as complete. I see an overall run time of 1min10sec – even though each worker has completed its build in less tha

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix – Awaiting Verification

Optimize GitHub Actions Queuing for Idle Runners

Medium Risk

The observed delay in queuing and processing builds is likely due to the way GitHub Actions handles job scheduling and resource allocation. Even with idle self-hosted runners, the queuing mechanism may introduce latency as it processes multiple jobs in batches rather than immediately dispatching them to available runners. This can be exacerbated by the large number of jobs in a matrix build, leading to inefficient job handling and extended wait times.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Reduce Matrix Size

    Consider breaking down the 4^4 matrix into smaller matrices to reduce the initial queuing time. For example, run a 4^2 matrix first and then a separate job for the remaining checks. This can help in managing the load and reducing the initial delay.

    yaml
    jobs:
      build_small:
        runs-on: self-hosted
        strategy:
          matrix:
            config: [1, 2, 3, 4]
        steps:
          - run: echo 'Running small matrix job'
  2. 2

    Increase Job Dispatch Rate

    Modify the workflow to dispatch jobs more aggressively. You can utilize the 'concurrency' feature to allow more jobs to run simultaneously, which can help in reducing the queuing time.

    yaml
    concurrency:
      group: 'build'
      cancel-in-progress: false
      max-parallel: 256
  3. 3

    Optimize Runner Configuration

    Ensure that your self-hosted runners are optimally configured. Check the runner's resource allocation (CPU, memory) and ensure they are not limited by system constraints. Increasing the resources available to the runners may help in processing jobs faster.

    bash
    sudo systemctl restart actions.runner.service
  4. 4

    Monitor Runner Performance

    Implement monitoring on your self-hosted runners to track their performance and identify any bottlenecks. Use tools like Prometheus or Grafana to visualize metrics such as CPU usage, memory usage, and job processing times.

    bash
    docker run -d --name=prometheus -p 9090:9090 prom/prometheus
  5. 5

    Review GitHub Actions Settings

    Check the GitHub Actions settings for your repository to ensure that there are no restrictions or limits set on job concurrency or runner usage that could be affecting performance.

    none
    Navigate to Settings > Actions > General and review the settings.

Validation

To confirm that the fix worked, monitor the queuing time and job dispatch times after implementing the changes. Ideally, the first job should start within a few seconds of triggering the workflow, and the overall workflow completion time should decrease significantly.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

github-actionsci-cdrunnerbug