safe_sleep.sh rarely hangs indefinitely
Problem
Describe the bug Very rarely on update of github actions runner `safe_sleep.sh` hangs forever: [code block] I suspect it may happen sometimes if machine runs in cloud and is overloaded and/or overcommitted. To Reproduce Steps to reproduce the behavior: 1. Download runner of version prior to version, for example, 2.322. 2. Register and run runner. 3. Runner updates itself. It can also take a task to complete in meantime. Expected behavior Update should not hang infinitely. Runner Version and Platform Runner 2.322 OS: Linux
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Solution: safe_sleep.sh rarely hangs indefinitely
script in question The bug in this "safe sleep" script is obvious from looking at it: if the process is not scheduled for the one-second interval in which the loop would return (due to `$SECONDS` having the correct value), then it simply spins forever. That can easily happen on a CI machine under extreme load. When this happens, it's pretty bad: it completely breaks a runner until manual interven
Trust Score
12 verifications
- 1
The bug in this "safe sleep" script is obvious from looking at it: if the proces
The bug in this "safe sleep" script is obvious from looking at it: if the process is not scheduled for the one-second interval in which the loop would return (due to `$SECONDS` having the correct value), then it simply spins forever. That can easily happen on a CI machine under extreme load. When this happens, it's pretty bad: it completely breaks a runner until manual intervention. On Zig's CI runner machines, we observed multiple of these processes which had been running for hundreds of hours, silently taking down two runner services for weeks.
- 2
I don't understand how we got here. Even ignoring the pretty clear bug, what mak
I don't understand how we got here. Even ignoring the pretty clear bug, what makes this Bash script "safer" than calling into the POSIX standard `sleep` utility? It doesn't seem to solve any problem; meanwhile, it's less portable and needlessly eats CPU time by busy-waiting.
- 3
The sloppy coding which is evident here, as well as the inaction on core Actions
The sloppy coding which is evident here, as well as the inaction on core Actions bugs (in line with the decay in quality of almost every part of GitHub's product), is forcing the Zig project to strongly consider moving away from GitHub Actions entirely. With this bug, and many others (severe workflow scheduling issues resulting in dozens of timeouts; logs randomly becoming inaccessible; random job cancellations without details; perpetually "pending" jobs), we can no longer trust that Actions can be used to implement reliable CI infrastructure. I personally would seriously encourage other proje
Validation
Resolved in actions/runner GitHub issue #3792. Community reactions: 363 upvotes.
Verification Summary
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep