Hi Experts,
I have a flink cluster (per job mode) running on kubernetes. The job is configured with restart strategy restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s So after 3 times retry, the job will be marked as FAILED, hence the pods are not running. However, kubernetes will then restart the job again as the available replicas do not match the desired one. I wonder what are the suggestions for such a scenario? How should I configure the flink job running on k8s? Thanks a lot! Eleanore |
Hi Eleanore,
how are you deploying Flink exactly? Are you using the application mode with native K8s support to deploy a cluster [1] or are you manually deploying a per-job mode [2]? I believe the problem might be that we terminate the Flink process with a non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. cc Yang Wang have you observed a similar behavior when running Flink in per-job mode on K8s? [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions [3] https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> wrote: > Hi Experts, > > I have a flink cluster (per job mode) running on kubernetes. The job is > configured with restart strategy > > restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s > > > So after 3 times retry, the job will be marked as FAILED, hence the pods > are not running. However, kubernetes will then restart the job again as the > available replicas do not match the desired one. > > I wonder what are the suggestions for such a scenario? How should I > configure the flink job running on k8s? > > Thanks a lot! > Eleanore > |
Hi Till,
Thanks for the reply! I manually deploy as per-job mode [1] and I am using Flink 1.8.2. Specifically, I build a custom docker image, which I copied the app jar (not uber jar) and all its dependencies under /flink/lib. So my question is more like, in this case, if the job is marked as FAILED, which causes k8s to restart the pod, this seems not help at all, what are the suggestions for such scenario? Thanks a lot! Eleanore [1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> wrote: > Hi Eleanore, > > how are you deploying Flink exactly? Are you using the application mode > with native K8s support to deploy a cluster [1] or are you manually > deploying a per-job mode [2]? > > I believe the problem might be that we terminate the Flink process with a > non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. > > cc Yang Wang have you observed a similar behavior when running Flink in > per-job mode on K8s? > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions > [3] > https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 > > On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> > wrote: > >> Hi Experts, >> >> I have a flink cluster (per job mode) running on kubernetes. The job is >> configured with restart strategy >> >> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >> >> >> So after 3 times retry, the job will be marked as FAILED, hence the pods >> are not running. However, kubernetes will then restart the job again as the >> available replicas do not match the desired one. >> >> I wonder what are the suggestions for such a scenario? How should I >> configure the flink job running on k8s? >> >> Thanks a lot! >> Eleanore >> > |
Hi Eleanore,
I think you are using K8s resource "Job" to deploy the jobmanager. Please set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0. Refer here[1] for more information. Then, when the jobmanager failed because of any reason, the K8s job will be marked failed. And K8s will not restart the job again. [1]. https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup Best, Yang Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: > Hi Till, > > Thanks for the reply! > > I manually deploy as per-job mode [1] and I am using Flink 1.8.2. > Specifically, I build a custom docker image, which I copied the app jar > (not uber jar) and all its dependencies under /flink/lib. > > So my question is more like, in this case, if the job is marked as FAILED, > which causes k8s to restart the pod, this seems not help at all, what are > the suggestions for such scenario? > > Thanks a lot! > Eleanore > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes > > On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> wrote: > >> Hi Eleanore, >> >> how are you deploying Flink exactly? Are you using the application mode >> with native K8s support to deploy a cluster [1] or are you manually >> deploying a per-job mode [2]? >> >> I believe the problem might be that we terminate the Flink process with a >> non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. >> >> cc Yang Wang have you observed a similar behavior when running Flink in >> per-job mode on K8s? >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >> [2] >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >> [3] >> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >> >> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> >> wrote: >> >>> Hi Experts, >>> >>> I have a flink cluster (per job mode) running on kubernetes. The job is >>> configured with restart strategy >>> >>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>> >>> >>> So after 3 times retry, the job will be marked as FAILED, hence the pods >>> are not running. However, kubernetes will then restart the job again as the >>> available replicas do not match the desired one. >>> >>> I wonder what are the suggestions for such a scenario? How should I >>> configure the flink job running on k8s? >>> >>> Thanks a lot! >>> Eleanore >>> >> |
@Yang Wang <[hidden email]> I believe that we should rethink the
exit codes of Flink. In general you want K8s to restart a failed Flink process. Hence, an application which terminates in state FAILED should not return a non-zero exit code because it is a valid termination state. Cheers, Till On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> wrote: > Hi Eleanore, > > I think you are using K8s resource "Job" to deploy the jobmanager. Please > set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0. > Refer here[1] for more information. > > Then, when the jobmanager failed because of any reason, the K8s job will > be marked failed. And K8s will not restart the job again. > > [1]. > https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup > > > Best, > Yang > > Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: > >> Hi Till, >> >> Thanks for the reply! >> >> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >> Specifically, I build a custom docker image, which I copied the app jar >> (not uber jar) and all its dependencies under /flink/lib. >> >> So my question is more like, in this case, if the job is marked as >> FAILED, which causes k8s to restart the pod, this seems not help at all, >> what are the suggestions for such scenario? >> >> Thanks a lot! >> Eleanore >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >> >> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >> wrote: >> >>> Hi Eleanore, >>> >>> how are you deploying Flink exactly? Are you using the application mode >>> with native K8s support to deploy a cluster [1] or are you manually >>> deploying a per-job mode [2]? >>> >>> I believe the problem might be that we terminate the Flink process with >>> a non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. >>> >>> cc Yang Wang have you observed a similar behavior when running Flink in >>> per-job mode on K8s? >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>> [2] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>> [3] >>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>> >>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> >>> wrote: >>> >>>> Hi Experts, >>>> >>>> I have a flink cluster (per job mode) running on kubernetes. The job is >>>> configured with restart strategy >>>> >>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>> >>>> >>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>> pods are not running. However, kubernetes will then restart the job again >>>> as the available replicas do not match the desired one. >>>> >>>> I wonder what are the suggestions for such a scenario? How should I >>>> configure the flink job running on k8s? >>>> >>>> Thanks a lot! >>>> Eleanore >>>> >>> |
@Till Rohrmann <[hidden email]> In native mode, when a Flink
application terminates with FAILED state, all the resources will be cleaned up. However, in standalone mode, I agree with you that we need to rethink the exit code of Flink. When a job exhausts the restart strategy, we should terminate the pod and do not restart again. After googling, it seems that we could not specify the restartPolicy based on exit code[1]. So maybe we need to return a zero exit code to avoid restarting by K8s. [1]. https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code Best, Yang Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: > @Yang Wang <[hidden email]> I believe that we should rethink the > exit codes of Flink. In general you want K8s to restart a failed Flink > process. Hence, an application which terminates in state FAILED should not > return a non-zero exit code because it is a valid termination state. > > Cheers, > Till > > On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> wrote: > >> Hi Eleanore, >> >> I think you are using K8s resource "Job" to deploy the jobmanager. Please >> set .spec.template.spec.restartPolicy = "Never" and spec.backoffLimit = 0. >> Refer here[1] for more information. >> >> Then, when the jobmanager failed because of any reason, the K8s job will >> be marked failed. And K8s will not restart the job again. >> >> [1]. >> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >> >> >> Best, >> Yang >> >> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >> >>> Hi Till, >>> >>> Thanks for the reply! >>> >>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>> Specifically, I build a custom docker image, which I copied the app jar >>> (not uber jar) and all its dependencies under /flink/lib. >>> >>> So my question is more like, in this case, if the job is marked as >>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>> what are the suggestions for such scenario? >>> >>> Thanks a lot! >>> Eleanore >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>> >>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>> wrote: >>> >>>> Hi Eleanore, >>>> >>>> how are you deploying Flink exactly? Are you using the application mode >>>> with native K8s support to deploy a cluster [1] or are you manually >>>> deploying a per-job mode [2]? >>>> >>>> I believe the problem might be that we terminate the Flink process with >>>> a non-zero exit code if the job reaches the ApplicationStatus.FAILED [3]. >>>> >>>> cc Yang Wang have you observed a similar behavior when running Flink in >>>> per-job mode on K8s? >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>> [2] >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>> [3] >>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>> >>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> >>>> wrote: >>>> >>>>> Hi Experts, >>>>> >>>>> I have a flink cluster (per job mode) running on kubernetes. The job >>>>> is configured with restart strategy >>>>> >>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>> >>>>> >>>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>>> pods are not running. However, kubernetes will then restart the job again >>>>> as the available replicas do not match the desired one. >>>>> >>>>> I wonder what are the suggestions for such a scenario? How should I >>>>> configure the flink job running on k8s? >>>>> >>>>> Thanks a lot! >>>>> Eleanore >>>>> >>>> |
Hi Yang & Till,
Thanks for your prompt reply! Yang, regarding your question, I am actually not using k8s job, as I put my app.jar and its dependencies under flink's lib directory. I have 1 k8s deployment for job manager, and 1 k8s deployment for task manager, and 1 k8s service for job manager. As you mentioned above, if flink job is marked as failed, it will cause the job manager pod to be restarted. Which is not the ideal behavior. Do you suggest that I should change the deployment strategy from using k8s deployment to k8s job? In case the flink program exit with non-zero code (e.g. exhausted number of configured restart), pod can be marked as complete hence not restarting the job again? Thanks a lot! Eleanore On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> wrote: > @Till Rohrmann <[hidden email]> In native mode, when a Flink > application terminates with FAILED state, all the resources will be cleaned > up. > > However, in standalone mode, I agree with you that we need to rethink the > exit code of Flink. When a job exhausts the restart > strategy, we should terminate the pod and do not restart again. After > googling, it seems that we could not specify the restartPolicy > based on exit code[1]. So maybe we need to return a zero exit code to > avoid restarting by K8s. > > [1]. > https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code > > Best, > Yang > > Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: > >> @Yang Wang <[hidden email]> I believe that we should rethink the >> exit codes of Flink. In general you want K8s to restart a failed Flink >> process. Hence, an application which terminates in state FAILED should not >> return a non-zero exit code because it is a valid termination state. >> >> Cheers, >> Till >> >> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> wrote: >> >>> Hi Eleanore, >>> >>> I think you are using K8s resource "Job" to deploy the jobmanager. >>> Please set .spec.template.spec.restartPolicy = "Never" and >>> spec.backoffLimit = 0. >>> Refer here[1] for more information. >>> >>> Then, when the jobmanager failed because of any reason, the K8s job will >>> be marked failed. And K8s will not restart the job again. >>> >>> [1]. >>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>> >>> >>> Best, >>> Yang >>> >>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>> >>>> Hi Till, >>>> >>>> Thanks for the reply! >>>> >>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>> Specifically, I build a custom docker image, which I copied the app jar >>>> (not uber jar) and all its dependencies under /flink/lib. >>>> >>>> So my question is more like, in this case, if the job is marked as >>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>> what are the suggestions for such scenario? >>>> >>>> Thanks a lot! >>>> Eleanore >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>> >>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>>> wrote: >>>> >>>>> Hi Eleanore, >>>>> >>>>> how are you deploying Flink exactly? Are you using the application >>>>> mode with native K8s support to deploy a cluster [1] or are you manually >>>>> deploying a per-job mode [2]? >>>>> >>>>> I believe the problem might be that we terminate the Flink process >>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED >>>>> [3]. >>>>> >>>>> cc Yang Wang have you observed a similar behavior when running Flink >>>>> in per-job mode on K8s? >>>>> >>>>> [1] >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>> [2] >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>> [3] >>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>> >>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> >>>>> wrote: >>>>> >>>>>> Hi Experts, >>>>>> >>>>>> I have a flink cluster (per job mode) running on kubernetes. The job >>>>>> is configured with restart strategy >>>>>> >>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>> >>>>>> >>>>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>>>> pods are not running. However, kubernetes will then restart the job again >>>>>> as the available replicas do not match the desired one. >>>>>> >>>>>> I wonder what are the suggestions for such a scenario? How should I >>>>>> configure the flink job running on k8s? >>>>>> >>>>>> Thanks a lot! >>>>>> Eleanore >>>>>> >>>>> |
Hi Eleanore,
Yes, I suggest to use Job to replace Deployment. It could be used to run jobmanager one time and finish after a successful/failed completion. However, using Job still could not solve your problem completely. Just as Till said, When a job exhausts the restart strategy, the jobmanager pod will terminate with non-zero exit code. It will cause the K8s restarting it again. Even though we could set the resartPolicy and backoffLimit, this is not a clean and correct way to go. We should terminate the jobmanager process with zero exit code in such situation. @Till Rohrmann <[hidden email]> I just have one concern. Is it a special case for K8s deployment? For standalone/Yarn/Mesos, it seems that terminating with non-zero exit code is harmless. Best, Yang Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: > Hi Yang & Till, > > Thanks for your prompt reply! > > Yang, regarding your question, I am actually not using k8s job, as I put > my app.jar and its dependencies under flink's lib directory. I have 1 k8s > deployment for job manager, and 1 k8s deployment for task manager, and 1 > k8s service for job manager. > > As you mentioned above, if flink job is marked as failed, it will cause > the job manager pod to be restarted. Which is not the ideal behavior. > > Do you suggest that I should change the deployment strategy from using k8s > deployment to k8s job? In case the flink program exit with non-zero code > (e.g. exhausted number of configured restart), pod can be marked as > complete hence not restarting the job again? > > Thanks a lot! > Eleanore > > On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> wrote: > >> @Till Rohrmann <[hidden email]> In native mode, when a Flink >> application terminates with FAILED state, all the resources will be cleaned >> up. >> >> However, in standalone mode, I agree with you that we need to rethink the >> exit code of Flink. When a job exhausts the restart >> strategy, we should terminate the pod and do not restart again. After >> googling, it seems that we could not specify the restartPolicy >> based on exit code[1]. So maybe we need to return a zero exit code to >> avoid restarting by K8s. >> >> [1]. >> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >> >> Best, >> Yang >> >> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >> >>> @Yang Wang <[hidden email]> I believe that we should rethink the >>> exit codes of Flink. In general you want K8s to restart a failed Flink >>> process. Hence, an application which terminates in state FAILED should not >>> return a non-zero exit code because it is a valid termination state. >>> >>> Cheers, >>> Till >>> >>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> wrote: >>> >>>> Hi Eleanore, >>>> >>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>> spec.backoffLimit = 0. >>>> Refer here[1] for more information. >>>> >>>> Then, when the jobmanager failed because of any reason, the K8s job >>>> will be marked failed. And K8s will not restart the job again. >>>> >>>> [1]. >>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>> >>>>> Hi Till, >>>>> >>>>> Thanks for the reply! >>>>> >>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>> >>>>> So my question is more like, in this case, if the job is marked as >>>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>> what are the suggestions for such scenario? >>>>> >>>>> Thanks a lot! >>>>> Eleanore >>>>> >>>>> [1] >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>> >>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>>>> wrote: >>>>> >>>>>> Hi Eleanore, >>>>>> >>>>>> how are you deploying Flink exactly? Are you using the application >>>>>> mode with native K8s support to deploy a cluster [1] or are you manually >>>>>> deploying a per-job mode [2]? >>>>>> >>>>>> I believe the problem might be that we terminate the Flink process >>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED >>>>>> [3]. >>>>>> >>>>>> cc Yang Wang have you observed a similar behavior when running Flink >>>>>> in per-job mode on K8s? >>>>>> >>>>>> [1] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>> [2] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>> [3] >>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>> >>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> >>>>>> wrote: >>>>>> >>>>>>> Hi Experts, >>>>>>> >>>>>>> I have a flink cluster (per job mode) running on kubernetes. The job >>>>>>> is configured with restart strategy >>>>>>> >>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>> >>>>>>> >>>>>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>>>>> pods are not running. However, kubernetes will then restart the job again >>>>>>> as the available replicas do not match the desired one. >>>>>>> >>>>>>> I wonder what are the suggestions for such a scenario? How should I >>>>>>> configure the flink job running on k8s? >>>>>>> >>>>>>> Thanks a lot! >>>>>>> Eleanore >>>>>>> >>>>>> |
Yes for the other deployments it is not a problem. A reason why people
preferred non-zero exit codes in case of FAILED jobs is that this is easier to monitor than having to take a look at the actual job result. Moreover, in the YARN web UI the application shows as failed if I am not mistaken. However, from a framework's perspective, a FAILED job does not mean that Flink has failed and, hence, the return code could still be 0 in my opinion. Cheers, Till On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[hidden email]> wrote: > Hi Eleanore, > > Yes, I suggest to use Job to replace Deployment. It could be used to run > jobmanager one time and finish after a successful/failed completion. > > However, using Job still could not solve your problem completely. Just as > Till said, When a job exhausts the restart strategy, the jobmanager > pod will terminate with non-zero exit code. It will cause the K8s > restarting it again. Even though we could set the resartPolicy and > backoffLimit, > this is not a clean and correct way to go. We should terminate the > jobmanager process with zero exit code in such situation. > > @Till Rohrmann <[hidden email]> I just have one concern. Is it a > special case for K8s deployment? For standalone/Yarn/Mesos, it seems that > terminating with > non-zero exit code is harmless. > > > Best, > Yang > > Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: > >> Hi Yang & Till, >> >> Thanks for your prompt reply! >> >> Yang, regarding your question, I am actually not using k8s job, as I put >> my app.jar and its dependencies under flink's lib directory. I have 1 k8s >> deployment for job manager, and 1 k8s deployment for task manager, and 1 >> k8s service for job manager. >> >> As you mentioned above, if flink job is marked as failed, it will cause >> the job manager pod to be restarted. Which is not the ideal behavior. >> >> Do you suggest that I should change the deployment strategy from using >> k8s deployment to k8s job? In case the flink program exit with non-zero >> code (e.g. exhausted number of configured restart), pod can be marked as >> complete hence not restarting the job again? >> >> Thanks a lot! >> Eleanore >> >> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> wrote: >> >>> @Till Rohrmann <[hidden email]> In native mode, when a Flink >>> application terminates with FAILED state, all the resources will be cleaned >>> up. >>> >>> However, in standalone mode, I agree with you that we need to rethink >>> the exit code of Flink. When a job exhausts the restart >>> strategy, we should terminate the pod and do not restart again. After >>> googling, it seems that we could not specify the restartPolicy >>> based on exit code[1]. So maybe we need to return a zero exit code to >>> avoid restarting by K8s. >>> >>> [1]. >>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>> >>> Best, >>> Yang >>> >>> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >>> >>>> @Yang Wang <[hidden email]> I believe that we should >>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>> failed Flink process. Hence, an application which terminates in state >>>> FAILED should not return a non-zero exit code because it is a valid >>>> termination state. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> wrote: >>>> >>>>> Hi Eleanore, >>>>> >>>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>>> spec.backoffLimit = 0. >>>>> Refer here[1] for more information. >>>>> >>>>> Then, when the jobmanager failed because of any reason, the K8s job >>>>> will be marked failed. And K8s will not restart the job again. >>>>> >>>>> [1]. >>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>> >>>>> >>>>> Best, >>>>> Yang >>>>> >>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>>> >>>>>> Hi Till, >>>>>> >>>>>> Thanks for the reply! >>>>>> >>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>>> >>>>>> So my question is more like, in this case, if the job is marked as >>>>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>>> what are the suggestions for such scenario? >>>>>> >>>>>> Thanks a lot! >>>>>> Eleanore >>>>>> >>>>>> [1] >>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>> >>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>>>>> wrote: >>>>>> >>>>>>> Hi Eleanore, >>>>>>> >>>>>>> how are you deploying Flink exactly? Are you using the application >>>>>>> mode with native K8s support to deploy a cluster [1] or are you manually >>>>>>> deploying a per-job mode [2]? >>>>>>> >>>>>>> I believe the problem might be that we terminate the Flink process >>>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED >>>>>>> [3]. >>>>>>> >>>>>>> cc Yang Wang have you observed a similar behavior when running Flink >>>>>>> in per-job mode on K8s? >>>>>>> >>>>>>> [1] >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>> [2] >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>> [3] >>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>> >>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin <[hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Experts, >>>>>>>> >>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The >>>>>>>> job is configured with restart strategy >>>>>>>> >>>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>> >>>>>>>> >>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence the >>>>>>>> pods are not running. However, kubernetes will then restart the job again >>>>>>>> as the available replicas do not match the desired one. >>>>>>>> >>>>>>>> I wonder what are the suggestions for such a scenario? How should I >>>>>>>> configure the flink job running on k8s? >>>>>>>> >>>>>>>> Thanks a lot! >>>>>>>> Eleanore >>>>>>>> >>>>>>> |
Actually, the application status shows in YARN web UI is not determined by
the jobmanager process exit code. Instead, we use "resourceManagerClient.unregisterApplicationMaster" to control the final status of YARN application. So although jobmanager exit with zero code, it still could show failed status in YARN web UI. I have created a ticket to track this improvement[1]. [1]. https://issues.apache.org/jira/browse/FLINK-18828 Best, Yang Till Rohrmann <[hidden email]> 于2020年8月5日周三 下午3:56写道: > Yes for the other deployments it is not a problem. A reason why people > preferred non-zero exit codes in case of FAILED jobs is that this is easier > to monitor than having to take a look at the actual job result. Moreover, > in the YARN web UI the application shows as failed if I am not mistaken. > However, from a framework's perspective, a FAILED job does not mean that > Flink has failed and, hence, the return code could still be 0 in my opinion. > > Cheers, > Till > > On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[hidden email]> wrote: > >> Hi Eleanore, >> >> Yes, I suggest to use Job to replace Deployment. It could be used to run >> jobmanager one time and finish after a successful/failed completion. >> >> However, using Job still could not solve your problem completely. Just as >> Till said, When a job exhausts the restart strategy, the jobmanager >> pod will terminate with non-zero exit code. It will cause the K8s >> restarting it again. Even though we could set the resartPolicy and >> backoffLimit, >> this is not a clean and correct way to go. We should terminate the >> jobmanager process with zero exit code in such situation. >> >> @Till Rohrmann <[hidden email]> I just have one concern. Is it a >> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that >> terminating with >> non-zero exit code is harmless. >> >> >> Best, >> Yang >> >> Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: >> >>> Hi Yang & Till, >>> >>> Thanks for your prompt reply! >>> >>> Yang, regarding your question, I am actually not using k8s job, as I put >>> my app.jar and its dependencies under flink's lib directory. I have 1 k8s >>> deployment for job manager, and 1 k8s deployment for task manager, and 1 >>> k8s service for job manager. >>> >>> As you mentioned above, if flink job is marked as failed, it will cause >>> the job manager pod to be restarted. Which is not the ideal behavior. >>> >>> Do you suggest that I should change the deployment strategy from using >>> k8s deployment to k8s job? In case the flink program exit with non-zero >>> code (e.g. exhausted number of configured restart), pod can be marked as >>> complete hence not restarting the job again? >>> >>> Thanks a lot! >>> Eleanore >>> >>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> wrote: >>> >>>> @Till Rohrmann <[hidden email]> In native mode, when a Flink >>>> application terminates with FAILED state, all the resources will be cleaned >>>> up. >>>> >>>> However, in standalone mode, I agree with you that we need to rethink >>>> the exit code of Flink. When a job exhausts the restart >>>> strategy, we should terminate the pod and do not restart again. After >>>> googling, it seems that we could not specify the restartPolicy >>>> based on exit code[1]. So maybe we need to return a zero exit code to >>>> avoid restarting by K8s. >>>> >>>> [1]. >>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>> >>>> Best, >>>> Yang >>>> >>>> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >>>> >>>>> @Yang Wang <[hidden email]> I believe that we should >>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>> failed Flink process. Hence, an application which terminates in state >>>>> FAILED should not return a non-zero exit code because it is a valid >>>>> termination state. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> >>>>> wrote: >>>>> >>>>>> Hi Eleanore, >>>>>> >>>>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>>>> spec.backoffLimit = 0. >>>>>> Refer here[1] for more information. >>>>>> >>>>>> Then, when the jobmanager failed because of any reason, the K8s job >>>>>> will be marked failed. And K8s will not restart the job again. >>>>>> >>>>>> [1]. >>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>> >>>>>> >>>>>> Best, >>>>>> Yang >>>>>> >>>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>>>> >>>>>>> Hi Till, >>>>>>> >>>>>>> Thanks for the reply! >>>>>>> >>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>>>> >>>>>>> So my question is more like, in this case, if the job is marked as >>>>>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>>>> what are the suggestions for such scenario? >>>>>>> >>>>>>> Thanks a lot! >>>>>>> Eleanore >>>>>>> >>>>>>> [1] >>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>> >>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Eleanore, >>>>>>>> >>>>>>>> how are you deploying Flink exactly? Are you using the application >>>>>>>> mode with native K8s support to deploy a cluster [1] or are you manually >>>>>>>> deploying a per-job mode [2]? >>>>>>>> >>>>>>>> I believe the problem might be that we terminate the Flink process >>>>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED >>>>>>>> [3]. >>>>>>>> >>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>> Flink in per-job mode on K8s? >>>>>>>> >>>>>>>> [1] >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>> [2] >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>> [3] >>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>> >>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>> [hidden email]> wrote: >>>>>>>> >>>>>>>>> Hi Experts, >>>>>>>>> >>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The >>>>>>>>> job is configured with restart strategy >>>>>>>>> >>>>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>> >>>>>>>>> >>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence >>>>>>>>> the pods are not running. However, kubernetes will then restart the job >>>>>>>>> again as the available replicas do not match the desired one. >>>>>>>>> >>>>>>>>> I wonder what are the suggestions for such a scenario? How should >>>>>>>>> I configure the flink job running on k8s? >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> Eleanore >>>>>>>>> >>>>>>>> |
You are right Yang Wang.
Thanks for creating this issue. Cheers, Till On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <[hidden email]> wrote: > Actually, the application status shows in YARN web UI is not determined by > the jobmanager process exit code. > Instead, we use "resourceManagerClient.unregisterApplicationMaster" to > control the final status of YARN application. > So although jobmanager exit with zero code, it still could show failed > status in YARN web UI. > > I have created a ticket to track this improvement[1]. > > [1]. https://issues.apache.org/jira/browse/FLINK-18828 > > > Best, > Yang > > > Till Rohrmann <[hidden email]> 于2020年8月5日周三 下午3:56写道: > >> Yes for the other deployments it is not a problem. A reason why people >> preferred non-zero exit codes in case of FAILED jobs is that this is easier >> to monitor than having to take a look at the actual job result. Moreover, >> in the YARN web UI the application shows as failed if I am not mistaken. >> However, from a framework's perspective, a FAILED job does not mean that >> Flink has failed and, hence, the return code could still be 0 in my opinion. >> >> Cheers, >> Till >> >> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[hidden email]> wrote: >> >>> Hi Eleanore, >>> >>> Yes, I suggest to use Job to replace Deployment. It could be used to run >>> jobmanager one time and finish after a successful/failed completion. >>> >>> However, using Job still could not solve your problem completely. Just >>> as Till said, When a job exhausts the restart strategy, the jobmanager >>> pod will terminate with non-zero exit code. It will cause the K8s >>> restarting it again. Even though we could set the resartPolicy and >>> backoffLimit, >>> this is not a clean and correct way to go. We should terminate the >>> jobmanager process with zero exit code in such situation. >>> >>> @Till Rohrmann <[hidden email]> I just have one concern. Is it a >>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that >>> terminating with >>> non-zero exit code is harmless. >>> >>> >>> Best, >>> Yang >>> >>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: >>> >>>> Hi Yang & Till, >>>> >>>> Thanks for your prompt reply! >>>> >>>> Yang, regarding your question, I am actually not using k8s job, as I >>>> put my app.jar and its dependencies under flink's lib directory. I have 1 >>>> k8s deployment for job manager, and 1 k8s deployment for task manager, and >>>> 1 k8s service for job manager. >>>> >>>> As you mentioned above, if flink job is marked as failed, it will cause >>>> the job manager pod to be restarted. Which is not the ideal behavior. >>>> >>>> Do you suggest that I should change the deployment strategy from using >>>> k8s deployment to k8s job? In case the flink program exit with non-zero >>>> code (e.g. exhausted number of configured restart), pod can be marked as >>>> complete hence not restarting the job again? >>>> >>>> Thanks a lot! >>>> Eleanore >>>> >>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> wrote: >>>> >>>>> @Till Rohrmann <[hidden email]> In native mode, when a Flink >>>>> application terminates with FAILED state, all the resources will be cleaned >>>>> up. >>>>> >>>>> However, in standalone mode, I agree with you that we need to rethink >>>>> the exit code of Flink. When a job exhausts the restart >>>>> strategy, we should terminate the pod and do not restart again. After >>>>> googling, it seems that we could not specify the restartPolicy >>>>> based on exit code[1]. So maybe we need to return a zero exit code to >>>>> avoid restarting by K8s. >>>>> >>>>> [1]. >>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>>> >>>>> Best, >>>>> Yang >>>>> >>>>> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >>>>> >>>>>> @Yang Wang <[hidden email]> I believe that we should >>>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>>> failed Flink process. Hence, an application which terminates in state >>>>>> FAILED should not return a non-zero exit code because it is a valid >>>>>> termination state. >>>>>> >>>>>> Cheers, >>>>>> Till >>>>>> >>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> >>>>>> wrote: >>>>>> >>>>>>> Hi Eleanore, >>>>>>> >>>>>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>>>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>>>>> spec.backoffLimit = 0. >>>>>>> Refer here[1] for more information. >>>>>>> >>>>>>> Then, when the jobmanager failed because of any reason, the K8s job >>>>>>> will be marked failed. And K8s will not restart the job again. >>>>>>> >>>>>>> [1]. >>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Yang >>>>>>> >>>>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>>>>> >>>>>>>> Hi Till, >>>>>>>> >>>>>>>> Thanks for the reply! >>>>>>>> >>>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>>>>> >>>>>>>> So my question is more like, in this case, if the job is marked as >>>>>>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>>>>> what are the suggestions for such scenario? >>>>>>>> >>>>>>>> Thanks a lot! >>>>>>>> Eleanore >>>>>>>> >>>>>>>> [1] >>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>>> >>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Eleanore, >>>>>>>>> >>>>>>>>> how are you deploying Flink exactly? Are you using the application >>>>>>>>> mode with native K8s support to deploy a cluster [1] or are you manually >>>>>>>>> deploying a per-job mode [2]? >>>>>>>>> >>>>>>>>> I believe the problem might be that we terminate the Flink process >>>>>>>>> with a non-zero exit code if the job reaches the ApplicationStatus.FAILED >>>>>>>>> [3]. >>>>>>>>> >>>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>>> Flink in per-job mode on K8s? >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>>> [2] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>>> [3] >>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>>> >>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>>> [hidden email]> wrote: >>>>>>>>> >>>>>>>>>> Hi Experts, >>>>>>>>>> >>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The >>>>>>>>>> job is configured with restart strategy >>>>>>>>>> >>>>>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence >>>>>>>>>> the pods are not running. However, kubernetes will then restart the job >>>>>>>>>> again as the available replicas do not match the desired one. >>>>>>>>>> >>>>>>>>>> I wonder what are the suggestions for such a scenario? How should >>>>>>>>>> I configure the flink job running on k8s? >>>>>>>>>> >>>>>>>>>> Thanks a lot! >>>>>>>>>> Eleanore >>>>>>>>>> >>>>>>>>> |
Hi Yang and Till,
Thanks a lot for the help! I have the similar question as Till mentioned, if we do not fail Flink pods when the restart strategy is exhausted, it might be hard to monitor such failures. Today I get alerts if the k8s pods are restarted or in crash loop, but if this will no longer be the case, how can we deal with the monitoring? In production, I have hundreds of small flink jobs running (2-8 TM pods) doing stateless processing, it is really hard for us to expose ingress for each JM rest endpoint to periodically query the job status for each flink job. Thanks a lot! Eleanore On Wed, Aug 5, 2020 at 4:56 AM Till Rohrmann <[hidden email]> wrote: > You are right Yang Wang. > > Thanks for creating this issue. > > Cheers, > Till > > On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <[hidden email]> wrote: > >> Actually, the application status shows in YARN web UI is not determined >> by the jobmanager process exit code. >> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to >> control the final status of YARN application. >> So although jobmanager exit with zero code, it still could show failed >> status in YARN web UI. >> >> I have created a ticket to track this improvement[1]. >> >> [1]. https://issues.apache.org/jira/browse/FLINK-18828 >> >> >> Best, >> Yang >> >> >> Till Rohrmann <[hidden email]> 于2020年8月5日周三 下午3:56写道: >> >>> Yes for the other deployments it is not a problem. A reason why people >>> preferred non-zero exit codes in case of FAILED jobs is that this is easier >>> to monitor than having to take a look at the actual job result. Moreover, >>> in the YARN web UI the application shows as failed if I am not mistaken. >>> However, from a framework's perspective, a FAILED job does not mean that >>> Flink has failed and, hence, the return code could still be 0 in my opinion. >>> >>> Cheers, >>> Till >>> >>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[hidden email]> wrote: >>> >>>> Hi Eleanore, >>>> >>>> Yes, I suggest to use Job to replace Deployment. It could be used >>>> to run jobmanager one time and finish after a successful/failed completion. >>>> >>>> However, using Job still could not solve your problem completely. Just >>>> as Till said, When a job exhausts the restart strategy, the jobmanager >>>> pod will terminate with non-zero exit code. It will cause the K8s >>>> restarting it again. Even though we could set the resartPolicy and >>>> backoffLimit, >>>> this is not a clean and correct way to go. We should terminate the >>>> jobmanager process with zero exit code in such situation. >>>> >>>> @Till Rohrmann <[hidden email]> I just have one concern. Is it a >>>> special case for K8s deployment? For standalone/Yarn/Mesos, it seems that >>>> terminating with >>>> non-zero exit code is harmless. >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: >>>> >>>>> Hi Yang & Till, >>>>> >>>>> Thanks for your prompt reply! >>>>> >>>>> Yang, regarding your question, I am actually not using k8s job, as I >>>>> put my app.jar and its dependencies under flink's lib directory. I have 1 >>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, and >>>>> 1 k8s service for job manager. >>>>> >>>>> As you mentioned above, if flink job is marked as failed, it will >>>>> cause the job manager pod to be restarted. Which is not the ideal >>>>> behavior. >>>>> >>>>> Do you suggest that I should change the deployment strategy from using >>>>> k8s deployment to k8s job? In case the flink program exit with non-zero >>>>> code (e.g. exhausted number of configured restart), pod can be marked as >>>>> complete hence not restarting the job again? >>>>> >>>>> Thanks a lot! >>>>> Eleanore >>>>> >>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> >>>>> wrote: >>>>> >>>>>> @Till Rohrmann <[hidden email]> In native mode, when a Flink >>>>>> application terminates with FAILED state, all the resources will be cleaned >>>>>> up. >>>>>> >>>>>> However, in standalone mode, I agree with you that we need to rethink >>>>>> the exit code of Flink. When a job exhausts the restart >>>>>> strategy, we should terminate the pod and do not restart again. After >>>>>> googling, it seems that we could not specify the restartPolicy >>>>>> based on exit code[1]. So maybe we need to return a zero exit code to >>>>>> avoid restarting by K8s. >>>>>> >>>>>> [1]. >>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>>>> >>>>>> Best, >>>>>> Yang >>>>>> >>>>>> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >>>>>> >>>>>>> @Yang Wang <[hidden email]> I believe that we should >>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>>>> failed Flink process. Hence, an application which terminates in state >>>>>>> FAILED should not return a non-zero exit code because it is a valid >>>>>>> termination state. >>>>>>> >>>>>>> Cheers, >>>>>>> Till >>>>>>> >>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Eleanore, >>>>>>>> >>>>>>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>>>>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>>>>>> spec.backoffLimit = 0. >>>>>>>> Refer here[1] for more information. >>>>>>>> >>>>>>>> Then, when the jobmanager failed because of any reason, the K8s job >>>>>>>> will be marked failed. And K8s will not restart the job again. >>>>>>>> >>>>>>>> [1]. >>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>>>> >>>>>>>> >>>>>>>> Best, >>>>>>>> Yang >>>>>>>> >>>>>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>>>>>> >>>>>>>>> Hi Till, >>>>>>>>> >>>>>>>>> Thanks for the reply! >>>>>>>>> >>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>>>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>>>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>>>>>> >>>>>>>>> So my question is more like, in this case, if the job is marked as >>>>>>>>> FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>>>>>> what are the suggestions for such scenario? >>>>>>>>> >>>>>>>>> Thanks a lot! >>>>>>>>> Eleanore >>>>>>>>> >>>>>>>>> [1] >>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>>>> >>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann <[hidden email]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Eleanore, >>>>>>>>>> >>>>>>>>>> how are you deploying Flink exactly? Are you using the >>>>>>>>>> application mode with native K8s support to deploy a cluster [1] or are you >>>>>>>>>> manually deploying a per-job mode [2]? >>>>>>>>>> >>>>>>>>>> I believe the problem might be that we terminate the Flink >>>>>>>>>> process with a non-zero exit code if the job reaches the >>>>>>>>>> ApplicationStatus.FAILED [3]. >>>>>>>>>> >>>>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>>>> Flink in per-job mode on K8s? >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>>>> [2] >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>>>> [3] >>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>>>> >>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>>>> [hidden email]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Experts, >>>>>>>>>>> >>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. The >>>>>>>>>>> job is configured with restart strategy >>>>>>>>>>> >>>>>>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence >>>>>>>>>>> the pods are not running. However, kubernetes will then restart the job >>>>>>>>>>> again as the available replicas do not match the desired one. >>>>>>>>>>> >>>>>>>>>>> I wonder what are the suggestions for such a scenario? How >>>>>>>>>>> should I configure the flink job running on k8s? >>>>>>>>>>> >>>>>>>>>>> Thanks a lot! >>>>>>>>>>> Eleanore >>>>>>>>>>> >>>>>>>>>> |
Hi Eleanore,
From my experience, collecting the Flink metrics to prometheus via metrics collector is a more ideal way. It is also easier to configure the alert. Maybe you could use "fullRestarts" or "numRestarts" to monitor the job restarting. More metrics could be find here[2]. [1]. https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter [2]. https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#availability Best, Yang Eleanore Jin <[hidden email]> 于2020年8月5日周三 下午11:52写道: > Hi Yang and Till, > > Thanks a lot for the help! I have the similar question as Till mentioned, > if we do not fail Flink pods when the restart strategy is exhausted, it > might be hard to monitor such failures. Today I get alerts if the k8s pods > are restarted or in crash loop, but if this will no longer be the case, how > can we deal with the monitoring? In production, I have hundreds of small > flink jobs running (2-8 TM pods) doing stateless processing, it is really > hard for us to expose ingress for each JM rest endpoint to periodically > query the job status for each flink job. > > Thanks a lot! > Eleanore > > On Wed, Aug 5, 2020 at 4:56 AM Till Rohrmann <[hidden email]> wrote: > >> You are right Yang Wang. >> >> Thanks for creating this issue. >> >> Cheers, >> Till >> >> On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <[hidden email]> wrote: >> >>> Actually, the application status shows in YARN web UI is not determined >>> by the jobmanager process exit code. >>> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to >>> control the final status of YARN application. >>> So although jobmanager exit with zero code, it still could show failed >>> status in YARN web UI. >>> >>> I have created a ticket to track this improvement[1]. >>> >>> [1]. https://issues.apache.org/jira/browse/FLINK-18828 >>> >>> >>> Best, >>> Yang >>> >>> >>> Till Rohrmann <[hidden email]> 于2020年8月5日周三 下午3:56写道: >>> >>>> Yes for the other deployments it is not a problem. A reason why people >>>> preferred non-zero exit codes in case of FAILED jobs is that this is easier >>>> to monitor than having to take a look at the actual job result. Moreover, >>>> in the YARN web UI the application shows as failed if I am not mistaken. >>>> However, from a framework's perspective, a FAILED job does not mean that >>>> Flink has failed and, hence, the return code could still be 0 in my opinion. >>>> >>>> Cheers, >>>> Till >>>> >>>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[hidden email]> wrote: >>>> >>>>> Hi Eleanore, >>>>> >>>>> Yes, I suggest to use Job to replace Deployment. It could be used >>>>> to run jobmanager one time and finish after a successful/failed completion. >>>>> >>>>> However, using Job still could not solve your problem completely. Just >>>>> as Till said, When a job exhausts the restart strategy, the jobmanager >>>>> pod will terminate with non-zero exit code. It will cause the K8s >>>>> restarting it again. Even though we could set the resartPolicy and >>>>> backoffLimit, >>>>> this is not a clean and correct way to go. We should terminate the >>>>> jobmanager process with zero exit code in such situation. >>>>> >>>>> @Till Rohrmann <[hidden email]> I just have one concern. Is it >>>>> a special case for K8s deployment? For standalone/Yarn/Mesos, it seems that >>>>> terminating with >>>>> non-zero exit code is harmless. >>>>> >>>>> >>>>> Best, >>>>> Yang >>>>> >>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: >>>>> >>>>>> Hi Yang & Till, >>>>>> >>>>>> Thanks for your prompt reply! >>>>>> >>>>>> Yang, regarding your question, I am actually not using k8s job, as I >>>>>> put my app.jar and its dependencies under flink's lib directory. I have 1 >>>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, and >>>>>> 1 k8s service for job manager. >>>>>> >>>>>> As you mentioned above, if flink job is marked as failed, it will >>>>>> cause the job manager pod to be restarted. Which is not the ideal >>>>>> behavior. >>>>>> >>>>>> Do you suggest that I should change the deployment strategy from >>>>>> using k8s deployment to k8s job? In case the flink program exit with >>>>>> non-zero code (e.g. exhausted number of configured restart), pod can be >>>>>> marked as complete hence not restarting the job again? >>>>>> >>>>>> Thanks a lot! >>>>>> Eleanore >>>>>> >>>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> >>>>>> wrote: >>>>>> >>>>>>> @Till Rohrmann <[hidden email]> In native mode, when a Flink >>>>>>> application terminates with FAILED state, all the resources will be cleaned >>>>>>> up. >>>>>>> >>>>>>> However, in standalone mode, I agree with you that we need to >>>>>>> rethink the exit code of Flink. When a job exhausts the restart >>>>>>> strategy, we should terminate the pod and do not restart again. >>>>>>> After googling, it seems that we could not specify the restartPolicy >>>>>>> based on exit code[1]. So maybe we need to return a zero exit code >>>>>>> to avoid restarting by K8s. >>>>>>> >>>>>>> [1]. >>>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>>>>> >>>>>>> Best, >>>>>>> Yang >>>>>>> >>>>>>> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >>>>>>> >>>>>>>> @Yang Wang <[hidden email]> I believe that we should >>>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>>>>> failed Flink process. Hence, an application which terminates in state >>>>>>>> FAILED should not return a non-zero exit code because it is a valid >>>>>>>> termination state. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Till >>>>>>>> >>>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Eleanore, >>>>>>>>> >>>>>>>>> I think you are using K8s resource "Job" to deploy the jobmanager. >>>>>>>>> Please set .spec.template.spec.restartPolicy = "Never" and >>>>>>>>> spec.backoffLimit = 0. >>>>>>>>> Refer here[1] for more information. >>>>>>>>> >>>>>>>>> Then, when the jobmanager failed because of any reason, the K8s >>>>>>>>> job will be marked failed. And K8s will not restart the job again. >>>>>>>>> >>>>>>>>> [1]. >>>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Yang >>>>>>>>> >>>>>>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>>>>>>> >>>>>>>>>> Hi Till, >>>>>>>>>> >>>>>>>>>> Thanks for the reply! >>>>>>>>>> >>>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink 1.8.2. >>>>>>>>>> Specifically, I build a custom docker image, which I copied the app jar >>>>>>>>>> (not uber jar) and all its dependencies under /flink/lib. >>>>>>>>>> >>>>>>>>>> So my question is more like, in this case, if the job is marked >>>>>>>>>> as FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>>>>>>> what are the suggestions for such scenario? >>>>>>>>>> >>>>>>>>>> Thanks a lot! >>>>>>>>>> Eleanore >>>>>>>>>> >>>>>>>>>> [1] >>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>>>>> >>>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann < >>>>>>>>>> [hidden email]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Eleanore, >>>>>>>>>>> >>>>>>>>>>> how are you deploying Flink exactly? Are you using the >>>>>>>>>>> application mode with native K8s support to deploy a cluster [1] or are you >>>>>>>>>>> manually deploying a per-job mode [2]? >>>>>>>>>>> >>>>>>>>>>> I believe the problem might be that we terminate the Flink >>>>>>>>>>> process with a non-zero exit code if the job reaches the >>>>>>>>>>> ApplicationStatus.FAILED [3]. >>>>>>>>>>> >>>>>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>>>>> Flink in per-job mode on K8s? >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>>>>> [2] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>>>>> [3] >>>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>>>>> >>>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>>>>> [hidden email]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Experts, >>>>>>>>>>>> >>>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. >>>>>>>>>>>> The job is configured with restart strategy >>>>>>>>>>>> >>>>>>>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, hence >>>>>>>>>>>> the pods are not running. However, kubernetes will then restart the job >>>>>>>>>>>> again as the available replicas do not match the desired one. >>>>>>>>>>>> >>>>>>>>>>>> I wonder what are the suggestions for such a scenario? How >>>>>>>>>>>> should I configure the flink job running on k8s? >>>>>>>>>>>> >>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>> Eleanore >>>>>>>>>>>> >>>>>>>>>>> |
Hi Yang,
Thanks a lot for the information! Eleanore On Thu, Aug 6, 2020 at 4:20 AM Yang Wang <[hidden email]> wrote: > Hi Eleanore, > > From my experience, collecting the Flink metrics to prometheus via metrics > collector is a more ideal way. It is > also easier to configure the alert. > Maybe you could use "fullRestarts" or "numRestarts" to monitor the job > restarting. More metrics could be find > here[2]. > > [1]. > https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#prometheus-orgapacheflinkmetricsprometheusprometheusreporter > [2]. > https://ci.apache.org/projects/flink/flink-docs-master/monitoring/metrics.html#availability > > Best, > Yang > > Eleanore Jin <[hidden email]> 于2020年8月5日周三 下午11:52写道: > >> Hi Yang and Till, >> >> Thanks a lot for the help! I have the similar question as Till mentioned, >> if we do not fail Flink pods when the restart strategy is exhausted, it >> might be hard to monitor such failures. Today I get alerts if the k8s pods >> are restarted or in crash loop, but if this will no longer be the case, how >> can we deal with the monitoring? In production, I have hundreds of small >> flink jobs running (2-8 TM pods) doing stateless processing, it is really >> hard for us to expose ingress for each JM rest endpoint to periodically >> query the job status for each flink job. >> >> Thanks a lot! >> Eleanore >> >> On Wed, Aug 5, 2020 at 4:56 AM Till Rohrmann <[hidden email]> >> wrote: >> >>> You are right Yang Wang. >>> >>> Thanks for creating this issue. >>> >>> Cheers, >>> Till >>> >>> On Wed, Aug 5, 2020 at 1:33 PM Yang Wang <[hidden email]> wrote: >>> >>>> Actually, the application status shows in YARN web UI is not determined >>>> by the jobmanager process exit code. >>>> Instead, we use "resourceManagerClient.unregisterApplicationMaster" to >>>> control the final status of YARN application. >>>> So although jobmanager exit with zero code, it still could show failed >>>> status in YARN web UI. >>>> >>>> I have created a ticket to track this improvement[1]. >>>> >>>> [1]. https://issues.apache.org/jira/browse/FLINK-18828 >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> >>>> Till Rohrmann <[hidden email]> 于2020年8月5日周三 下午3:56写道: >>>> >>>>> Yes for the other deployments it is not a problem. A reason why people >>>>> preferred non-zero exit codes in case of FAILED jobs is that this is easier >>>>> to monitor than having to take a look at the actual job result. Moreover, >>>>> in the YARN web UI the application shows as failed if I am not mistaken. >>>>> However, from a framework's perspective, a FAILED job does not mean that >>>>> Flink has failed and, hence, the return code could still be 0 in my opinion. >>>>> >>>>> Cheers, >>>>> Till >>>>> >>>>> On Wed, Aug 5, 2020 at 9:30 AM Yang Wang <[hidden email]> >>>>> wrote: >>>>> >>>>>> Hi Eleanore, >>>>>> >>>>>> Yes, I suggest to use Job to replace Deployment. It could be used >>>>>> to run jobmanager one time and finish after a successful/failed completion. >>>>>> >>>>>> However, using Job still could not solve your problem completely. >>>>>> Just as Till said, When a job exhausts the restart strategy, the jobmanager >>>>>> pod will terminate with non-zero exit code. It will cause the K8s >>>>>> restarting it again. Even though we could set the resartPolicy and >>>>>> backoffLimit, >>>>>> this is not a clean and correct way to go. We should terminate the >>>>>> jobmanager process with zero exit code in such situation. >>>>>> >>>>>> @Till Rohrmann <[hidden email]> I just have one concern. Is it >>>>>> a special case for K8s deployment? For standalone/Yarn/Mesos, it seems that >>>>>> terminating with >>>>>> non-zero exit code is harmless. >>>>>> >>>>>> >>>>>> Best, >>>>>> Yang >>>>>> >>>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 下午11:54写道: >>>>>> >>>>>>> Hi Yang & Till, >>>>>>> >>>>>>> Thanks for your prompt reply! >>>>>>> >>>>>>> Yang, regarding your question, I am actually not using k8s job, as I >>>>>>> put my app.jar and its dependencies under flink's lib directory. I have 1 >>>>>>> k8s deployment for job manager, and 1 k8s deployment for task manager, and >>>>>>> 1 k8s service for job manager. >>>>>>> >>>>>>> As you mentioned above, if flink job is marked as failed, it will >>>>>>> cause the job manager pod to be restarted. Which is not the ideal >>>>>>> behavior. >>>>>>> >>>>>>> Do you suggest that I should change the deployment strategy from >>>>>>> using k8s deployment to k8s job? In case the flink program exit with >>>>>>> non-zero code (e.g. exhausted number of configured restart), pod can be >>>>>>> marked as complete hence not restarting the job again? >>>>>>> >>>>>>> Thanks a lot! >>>>>>> Eleanore >>>>>>> >>>>>>> On Tue, Aug 4, 2020 at 2:49 AM Yang Wang <[hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> @Till Rohrmann <[hidden email]> In native mode, when a Flink >>>>>>>> application terminates with FAILED state, all the resources will be cleaned >>>>>>>> up. >>>>>>>> >>>>>>>> However, in standalone mode, I agree with you that we need to >>>>>>>> rethink the exit code of Flink. When a job exhausts the restart >>>>>>>> strategy, we should terminate the pod and do not restart again. >>>>>>>> After googling, it seems that we could not specify the restartPolicy >>>>>>>> based on exit code[1]. So maybe we need to return a zero exit code >>>>>>>> to avoid restarting by K8s. >>>>>>>> >>>>>>>> [1]. >>>>>>>> https://stackoverflow.com/questions/48797297/is-it-possible-to-define-restartpolicy-based-on-container-exit-code >>>>>>>> >>>>>>>> Best, >>>>>>>> Yang >>>>>>>> >>>>>>>> Till Rohrmann <[hidden email]> 于2020年8月4日周二 下午3:48写道: >>>>>>>> >>>>>>>>> @Yang Wang <[hidden email]> I believe that we should >>>>>>>>> rethink the exit codes of Flink. In general you want K8s to restart a >>>>>>>>> failed Flink process. Hence, an application which terminates in state >>>>>>>>> FAILED should not return a non-zero exit code because it is a valid >>>>>>>>> termination state. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Till >>>>>>>>> >>>>>>>>> On Tue, Aug 4, 2020 at 8:55 AM Yang Wang <[hidden email]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Eleanore, >>>>>>>>>> >>>>>>>>>> I think you are using K8s resource "Job" to deploy the >>>>>>>>>> jobmanager. Please set .spec.template.spec.restartPolicy = "Never" and >>>>>>>>>> spec.backoffLimit = 0. >>>>>>>>>> Refer here[1] for more information. >>>>>>>>>> >>>>>>>>>> Then, when the jobmanager failed because of any reason, the K8s >>>>>>>>>> job will be marked failed. And K8s will not restart the job again. >>>>>>>>>> >>>>>>>>>> [1]. >>>>>>>>>> https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Yang >>>>>>>>>> >>>>>>>>>> Eleanore Jin <[hidden email]> 于2020年8月4日周二 上午12:05写道: >>>>>>>>>> >>>>>>>>>>> Hi Till, >>>>>>>>>>> >>>>>>>>>>> Thanks for the reply! >>>>>>>>>>> >>>>>>>>>>> I manually deploy as per-job mode [1] and I am using Flink >>>>>>>>>>> 1.8.2. Specifically, I build a custom docker image, which I copied the app >>>>>>>>>>> jar (not uber jar) and all its dependencies under /flink/lib. >>>>>>>>>>> >>>>>>>>>>> So my question is more like, in this case, if the job is marked >>>>>>>>>>> as FAILED, which causes k8s to restart the pod, this seems not help at all, >>>>>>>>>>> what are the suggestions for such scenario? >>>>>>>>>>> >>>>>>>>>>> Thanks a lot! >>>>>>>>>>> Eleanore >>>>>>>>>>> >>>>>>>>>>> [1] >>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html#flink-job-cluster-on-kubernetes >>>>>>>>>>> >>>>>>>>>>> On Mon, Aug 3, 2020 at 2:13 AM Till Rohrmann < >>>>>>>>>>> [hidden email]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Eleanore, >>>>>>>>>>>> >>>>>>>>>>>> how are you deploying Flink exactly? Are you using the >>>>>>>>>>>> application mode with native K8s support to deploy a cluster [1] or are you >>>>>>>>>>>> manually deploying a per-job mode [2]? >>>>>>>>>>>> >>>>>>>>>>>> I believe the problem might be that we terminate the Flink >>>>>>>>>>>> process with a non-zero exit code if the job reaches the >>>>>>>>>>>> ApplicationStatus.FAILED [3]. >>>>>>>>>>>> >>>>>>>>>>>> cc Yang Wang have you observed a similar behavior when running >>>>>>>>>>>> Flink in per-job mode on K8s? >>>>>>>>>>>> >>>>>>>>>>>> [1] >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html#flink-kubernetes-application >>>>>>>>>>>> [2] >>>>>>>>>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html#job-cluster-resource-definitions >>>>>>>>>>>> [3] >>>>>>>>>>>> https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L32 >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jul 31, 2020 at 6:26 PM Eleanore Jin < >>>>>>>>>>>> [hidden email]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Experts, >>>>>>>>>>>>> >>>>>>>>>>>>> I have a flink cluster (per job mode) running on kubernetes. >>>>>>>>>>>>> The job is configured with restart strategy >>>>>>>>>>>>> >>>>>>>>>>>>> restart-strategy.fixed-delay.attempts: 3restart-strategy.fixed-delay.delay: 10 s >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> So after 3 times retry, the job will be marked as FAILED, >>>>>>>>>>>>> hence the pods are not running. However, kubernetes will then restart the >>>>>>>>>>>>> job again as the available replicas do not match the desired one. >>>>>>>>>>>>> >>>>>>>>>>>>> I wonder what are the suggestions for such a scenario? How >>>>>>>>>>>>> should I configure the flink job running on k8s? >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks a lot! >>>>>>>>>>>>> Eleanore >>>>>>>>>>>>> >>>>>>>>>>>> |
Free forum by Nabble | Edit this page |