[k8s.io] [Feature:Example] [k8s.io] Spark should start spark master, driver and workers {Kubernetes e2e suite}
Problem
https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/17028 Failed: [k8s.io] [Feature:Example] [k8s.io] Spark should start spark master, driver and workers 6.17s [code block] This test has been consistently failing for a long time https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/17027 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/17026 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/17024 https://k8s-gubernator.appspot.com/build/kubernetes-jenkins/logs/kubernetes-e2e-gce-examples/17028
Error Output
error:
<exec.CodeExitError>: {
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Fix Spark Master and Worker Startup in Kubernetes E2E Tests
The Spark master, driver, and worker pods are failing to start due to misconfiguration in resource requests and limits, or due to insufficient permissions in the Kubernetes cluster. The logs indicate that the pods are not able to communicate properly or are being terminated due to resource constraints.
Awaiting Verification
Be the first to verify this fix
- 1
Update Spark Configuration
Modify the Spark configuration to ensure that the resource requests and limits are set appropriately for the Kubernetes environment. This will help prevent the pods from being terminated due to resource constraints.
bashspark-submit --master k8s://https://<K8S_API_SERVER> --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=<SPARK_IMAGE> --conf spark.kubernetes.namespace=<NAMESPACE> --conf spark.kubernetes.executor.request.cores=1 --conf spark.kubernetes.executor.limit.cores=2 - 2
Check Kubernetes Role and RoleBinding
Ensure that the service account used by Spark has the necessary permissions to create and manage pods in the specified namespace. Create or update the Role and RoleBinding if necessary.
bashkubectl create role spark-role --verb=get,list,watch,create,update,delete --resource=pods --namespace=<NAMESPACE> kubectl create rolebinding spark-role-binding --role=spark-role --serviceaccount=<NAMESPACE>:<SERVICE_ACCOUNT> --namespace=<NAMESPACE> - 3
Increase Cluster Resources
If the cluster is running out of resources, consider increasing the available CPU and memory resources to accommodate the Spark pods. This can be done by resizing the nodes or adding more nodes to the cluster.
bashgcloud container clusters resize <CLUSTER_NAME> --node-pool <NODE_POOL_NAME> --num-nodes <NEW_NODE_COUNT> - 4
Run E2E Tests
After making the above changes, rerun the Kubernetes E2E tests to verify that the Spark master, driver, and worker pods start successfully without errors.
bashkubectl apply -f <SPARK_DEPLOYMENT_YAML> kubectl get pods --namespace=<NAMESPACE>
Validation
Confirm that the Spark master, driver, and worker pods are running successfully by checking their status using 'kubectl get pods --namespace=<NAMESPACE>'. Additionally, review the logs of the pods to ensure there are no errors related to resource allocation or permissions.
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep