Flink stop many times

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Flink stop many times

273339930
Dear all:
      我的flink经常挂掉,大约6天挂一次,我找了半天实在无法查出什么原因
      场景:读取kafka写hdfs,每小时差不多100G
      环境:kafka加了kerberos,集群并没有加kerberos认证
      现象:
       1.有时候checkpoint不成功

       2.刚开始装完flink,有问题,消费不了kafka,后来每台机器加了/etc/krb5.conf,然后每台机器hosts配置了KDC域名,问题解决了,可以了,但是没想到6天就挂了,有点怀疑是认证问题,但是不确定

       3.以下是挂掉前的日志,麻烦帮我分析下是啥原因
2020-03-06 01:11:27,027 INFO  org.apache.hadoop.hdfs.DFSClient                              - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12673/b1992c81-7821-4d68-9c0b-1302b24c42d0 retrying...
2020-03-06 01:11:33,457 INFO  org.apache.hadoop.hdfs.DFSClient                              - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12673/b1992c81-7821-4d68-9c0b-1302b24c42d0 retrying...
2020-03-06 01:11:56,869 INFO  org.apache.hadoop.hdfs.DFSClient                              - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12674/58c158bf-0583-4c14-a6d4-2873d1b0d9ac retrying...
2020-03-06 01:12:03,270 INFO  org.apache.hadoop.hdfs.DFSClient                              - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12674/58c158bf-0583-4c14-a6d4-2873d1b0d9ac retrying...
2020-03-06 01:12:18,283 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to cancel task Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0).
2020-03-06 01:12:18,283 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0) switched from RUNNING to CANCELING.
2020-03-06 01:12:18,283 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0).
2020-03-06 01:12:18,442 WARN  org.apache.kafka.common.security.kerberos.KerberosLogin       - [Principal=[hidden email]]: TGT renewal thread has been interrupted and will exit.
2020-03-06 01:12:18,443 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0) switched from CANCELING to CANCELED.
2020-03-06 01:12:18,443 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0).
2020-03-06 01:12:18,445 INFO  org.apache.flink.runtime.taskmanager.Task                     - Ensuring all FileSystem streams are closed for task Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0) [CANCELED]
2020-03-06 01:12:18,447 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Un-registering task and sending final execution state CANCELED to JobManager for task Source: Custom Source -> Sink: Unnamed 90484955ec6ef7ff8d705142bbf6b5e0.
2020-03-06 01:12:30,966 INFO  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable      - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=1935}, allocationId: 13e1e5aaa46415eb80b67fda822519f5, jobId: ab806ce6e5b0dda5eae3139eb784ddb8).
2020-03-06 01:12:30,967 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Remove job ab806ce6e5b0dda5eae3139eb784ddb8 from job leader monitoring.
2020-03-06 01:12:30,967 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close JobManager connection for job ab806ce6e5b0dda5eae3139eb784ddb8.
2020-03-06 01:12:31,931 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close JobManager connection for job ab806ce6e5b0dda5eae3139eb784ddb8.
2020-03-06 01:12:31,931 INFO  org.apache.flink.runtime.taskexecutor.JobLeaderService        - Cannot reconnect to job ab806ce6e5b0dda5eae3139eb784ddb8 because it is not registered.
2020-03-06 01:13:07,706 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Close ResourceManager connection 1aae239700043eb7d1bf819836cea65a.
2020-03-06 01:13:07,706 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting to ResourceManager akka.tcp://flink@bitpt417r01:36666/user/resourcemanager(00000000000000000000000000000000).
2020-03-06 01:13:07,722 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Resolved ResourceManager address, beginning registration
2020-03-06 01:13:07,722 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager attempt 1 (timeout=100ms)
2020-03-06 01:13:07,752 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Registration at ResourceManager was declined: unrecognized TaskExecutor
2020-03-06 01:13:07,753 INFO  org.apache.flink.runtime.taskexecutor.TaskExecutor            - Pausing and re-attempting registration in 30000 ms
2020-03-06 01:13:07,993 INFO  org.apache.flink.yarn.YarnTaskExecutorRunner                  - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2020-03-06 01:13:07,994 INFO  org.apache.flink.runtime.blob.TransientBlobCache              - Shutting down BLOB cache


谢谢