FG
☁️ Cloud & DevOpsAmazon

ci-kubernetes-e2e-gci-gce-reboot-release-1.5: broken test run

Freshabout 21 hours ago
Mar 14, 20260 views
Confidence Score55%
55%

Problem

https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-reboot-release-1.5/684/ Multiple broken tests: Failed: [k8s.io] Reboot [Disruptive] [Feature:Reboot] each node by dropping all inbound packets for a while and ensure they function afterwards {Kubernetes e2e suite} [code block] Issues about this test specifically: #33405 Failed: [k8s.io] Reboot [Disruptive] [Feature:Reboot] each node by dropping all outbound packets for a while and ensure they function afterwards {Kubernetes e2e suite} [code block] Issues about this test specifically: #33703 #36230 Failed: [k8s.io] Reboot [Disruptive] [Feature:Reboot] each node by ordering unclean reboot and ensure they function upon restart {Kubernetes e2e suite} [code block] Issues about this test specifically: #33882 #35316 Failed: [k8s.io] Reboot [Disruptive] [Feature:Reboot] each node by triggering kernel panic and ensure they function upon restart {Kubernetes e2e suite} [code block] Issues about this test specifically: #34123 #35398

Unverified for your environment

Select your OS to check compatibility.

1 Fix

Canonical Fix
Unverified Fix
New Fix – Awaiting Verification

Fix Flaky Reboot Tests in Kubernetes E2E Suite

Medium Risk

The reboot tests are failing due to network packet drop simulations that may not be accurately reflecting real-world scenarios. Additionally, unclean reboots and kernel panic handling might not be properly managed in the test environment, leading to inconsistent results across different runs.

Awaiting Verification

Be the first to verify this fix

  1. 1

    Review and Update Test Cases

    Investigate the existing test cases related to reboot functionality. Ensure that they are designed to handle various network conditions and system states. Update the tests to include more robust error handling and logging.

    go
    func TestRebootWithPacketDrop() {
        // Implement packet drop simulation
        // Add logging for network state before and after reboot
    }
  2. 2

    Enhance Network Simulation

    Modify the network simulation logic to better mimic real-world scenarios. This includes adjusting the duration and conditions under which packets are dropped, ensuring that the tests can handle transient network failures gracefully.

    bash
    iptables -A INPUT -j DROP
    sleep 30
    iptables -D INPUT -j DROP
  3. 3

    Implement Cleanup Procedures

    Add cleanup procedures to the tests to ensure that the system is in a known good state before each test run. This includes resetting network configurations and ensuring all nodes are healthy.

    bash
    kubectl delete pods --all --grace-period=0 --force
    kubectl get nodes | xargs -I {} kubectl drain {} --ignore-daemonsets --delete-local-data
  4. 4

    Run Tests in Isolated Environment

    Execute the tests in an isolated environment to reduce interference from other processes. This can be achieved by using dedicated test clusters or namespaces specifically configured for e2e tests.

    yaml
    kubectl create namespace e2e-test
    kubectl apply -f test-deployment.yaml -n e2e-test
  5. 5

    Monitor and Log Test Results

    Implement comprehensive logging for each test case to capture detailed information about failures. This will help in diagnosing issues and improving the tests over time.

    go
    log.Printf("Test %s failed at %s", testName, time.Now())

Validation

To confirm the fix worked, rerun the affected e2e tests and verify that they pass consistently across multiple runs. Monitor logs for any errors or anomalies during the test execution.

Sign in to verify this fix

Environment

Submitted by

AC

Alex Chen

2450 rep

Tags

kubernetesk8scontainerspriority/backlogsig/nodearea/test-infrakind/flake