Dear all:
我的flink经常挂掉,大约6天挂一次,我找了半天实在无法查出什么原因 场景:读取kafka写hdfs,每小时差不多100G 环境:kafka加了kerberos,集群并没有加kerberos认证 现象: 1.有时候checkpoint不成功 2.刚开始装完flink,有问题,消费不了kafka,后来每台机器加了/etc/krb5.conf,然后每台机器hosts配置了KDC域名,问题解决了,可以了,但是没想到6天就挂了,有点怀疑是认证问题,但是不确定 3.以下是挂掉前的日志,麻烦帮我分析下是啥原因 2020-03-06 01:11:27,027 INFO org.apache.hadoop.hdfs.DFSClient - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12673/b1992c81-7821-4d68-9c0b-1302b24c42d0 retrying... 2020-03-06 01:11:33,457 INFO org.apache.hadoop.hdfs.DFSClient - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12673/b1992c81-7821-4d68-9c0b-1302b24c42d0 retrying... 2020-03-06 01:11:56,869 INFO org.apache.hadoop.hdfs.DFSClient - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12674/58c158bf-0583-4c14-a6d4-2873d1b0d9ac retrying... 2020-03-06 01:12:03,270 INFO org.apache.hadoop.hdfs.DFSClient - Could not complete /user/shtermuser/holmes_hdfs/checkpoint/ab806ce6e5b0dda5eae3139eb784ddb8/chk-12674/58c158bf-0583-4c14-a6d4-2873d1b0d9ac retrying... 2020-03-06 01:12:18,283 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to cancel task Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0). 2020-03-06 01:12:18,283 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0) switched from RUNNING to CANCELING. 2020-03-06 01:12:18,283 INFO org.apache.flink.runtime.taskmanager.Task - Triggering cancellation of task code Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0). 2020-03-06 01:12:18,442 WARN org.apache.kafka.common.security.kerberos.KerberosLogin - [Principal=[hidden email]]: TGT renewal thread has been interrupted and will exit. 2020-03-06 01:12:18,443 INFO org.apache.flink.runtime.taskmanager.Task - Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0) switched from CANCELING to CANCELED. 2020-03-06 01:12:18,443 INFO org.apache.flink.runtime.taskmanager.Task - Freeing task resources for Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0). 2020-03-06 01:12:18,445 INFO org.apache.flink.runtime.taskmanager.Task - Ensuring all FileSystem streams are closed for task Source: Custom Source -> Sink: Unnamed (2/6) (90484955ec6ef7ff8d705142bbf6b5e0) [CANCELED] 2020-03-06 01:12:18,447 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and sending final execution state CANCELED to JobManager for task Source: Custom Source -> Sink: Unnamed 90484955ec6ef7ff8d705142bbf6b5e0. 2020-03-06 01:12:30,966 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=1935}, allocationId: 13e1e5aaa46415eb80b67fda822519f5, jobId: ab806ce6e5b0dda5eae3139eb784ddb8). 2020-03-06 01:12:30,967 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job ab806ce6e5b0dda5eae3139eb784ddb8 from job leader monitoring. 2020-03-06 01:12:30,967 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job ab806ce6e5b0dda5eae3139eb784ddb8. 2020-03-06 01:12:31,931 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job ab806ce6e5b0dda5eae3139eb784ddb8. 2020-03-06 01:12:31,931 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to job ab806ce6e5b0dda5eae3139eb784ddb8 because it is not registered. 2020-03-06 01:13:07,706 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close ResourceManager connection 1aae239700043eb7d1bf819836cea65a. 2020-03-06 01:13:07,706 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting to ResourceManager akka.tcp://flink@bitpt417r01:36666/user/resourcemanager(00000000000000000000000000000000). 2020-03-06 01:13:07,722 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Resolved ResourceManager address, beginning registration 2020-03-06 01:13:07,722 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager attempt 1 (timeout=100ms) 2020-03-06 01:13:07,752 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Registration at ResourceManager was declined: unrecognized TaskExecutor 2020-03-06 01:13:07,753 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Pausing and re-attempting registration in 30000 ms 2020-03-06 01:13:07,993 INFO org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested. 2020-03-06 01:13:07,994 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache 谢谢 |
Free forum by Nabble | Edit this page |