flink Application Native k8s使用oss作为backend日志偶尔报错

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

flink Application Native k8s使用oss作为backend日志偶尔报错

王 羽凡
版本:Flink 1.12.0
环境:Native Kubernetes
模式:Application Mode

描述:
Flink以Native Kubernetes Application模式运行在k8s时,使用filesystem OSS作为backend发现日志请求OSS报错。
当代码使用`source.setStartFromEarliest();`,启动job之后从头开始消费,运行过程正常,运行到最新点位后会出现以下报错,过一段时间或者重启job之后报错消失。
当代码使用`source.setStartFromLatest();`,启动job之后直接从最新点位开始消费,则不会出现此报错。
据观察请问是我哪里配置或者使用有问题么?

命令:
./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=demo \
    -Dkubernetes.container.image=xx/xx/xx:2.0.16 \
    -Dstate.backend=filesystem \
    -Dstate.checkpoints.dir=<a href="oss://bucket/文件夹" class="">oss://bucket/文件夹 \
    -Dfs.oss.endpoint=oss-cn-beijing-internal.aliyuncs.com \
    -Dfs.oss.accessKeyId=xx \
    -Dfs.oss.accessKeySecret=xx \
    <a href="local:///opt/flink/usrlib/my-flink-job.jar" class="">local:///opt/flink/usrlib/my-flink-job.jar

报错日志:
2021-03-03 02:53:46,133 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12701:1:-1:4 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午10:53:46 2021-03-03 02:53:46,140 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12701:1:-1:4 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午10:53:50 2021-03-03 02:53:50,899 INFO org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss [] - [Server]Unable to execute HTTP request: Not Found
2021/3/3 上午10:53:50 [ErrorCode]: NoSuchKey
2021/3/3 上午10:53:50 [RequestId]: xx
2021/3/3 上午10:53:50 [HostId]: null
2021/3/3 上午10:53:50 2021-03-03 02:53:50,904 INFO org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss [] - [Server]Unable to execute HTTP request: Not Found
2021/3/3 上午10:53:50 [ErrorCode]: NoSuchKey
2021/3/3 上午10:53:50 [RequestId]: xx
2021/3/3 上午10:53:50 [HostId]: null
kill进程pod重启或过一段时间后taskManager正常日志:
2021-03-03 03:18:21,602 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:26 2021-03-03 03:18:26,573 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:26 2021-03-03 03:18:26,582 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:31 2021-03-03 03:18:31,571 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:31 2021-03-03 03:18:31,580 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:36 2021-03-03 03:18:36,633 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:36 2021-03-03 03:18:36,642 INFO org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=<a href="persistent://public/xx/xxxx" class="">persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
oss内文件:
chk-10880目录:
Reply | Threaded
Open this post in threaded view
|

Re: flink Application Native k8s使用oss作为backend日志偶尔报错

王 羽凡
2021-03-04 02:33:25,292 DEBUG org.apache.flink.runtime.rpc.akka.SupervisorActor            [] - Starting FencedAkkaRpcActor with name jobmanager_2.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,304 INFO  org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] - Starting RPC endpoint for org.apache.flink.runtime.jobmaster.JobMaster at akka://flink/user/rpc/jobmanager_2 .
2021/3/4 上午10:33:25 2021-03-04 02:33:25,310 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Initializing job TransactionAndAccount (00000000000000000000000000000000).
2021/3/4 上午10:33:25 2021-03-04 02:33:25,323 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using restart back off time strategy FixedDelayRestartBackoffTimeStrategy(maxNumberRestartAttempts=2147483647, backoffTimeMS=1000) for TransactionAndAccount (00000000000000000000000000000000).
2021/3/4 上午10:33:25 2021-03-04 02:33:25,380 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Running initialization on master for job TransactionAndAccount (00000000000000000000000000000000).
2021/3/4 上午10:33:25 2021-03-04 02:33:25,380 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Successfully ran initialization on master in 0 ms.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,381 DEBUG org.apache.flink.runtime.jobmaster.JobMaster                 [] - Adding 2 vertices from job graph TransactionAndAccount (00000000000000000000000000000000).
2021/3/4 上午10:33:25 2021-03-04 02:33:25,381 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Attaching 2 topologically sorted vertices to existing job graph with 0 vertices and 0 intermediate results.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,389 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Connecting ExecutionJobVertex cbc357ccb763df2852fee8c4fc7d55f2 (Source: Custom Source -> format to json -> Filter -> process timestamp range -> Timestamps/Watermarks) to 0 predecessors.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,389 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Connecting ExecutionJobVertex 337adade1e207453ed3502e01d75fd03 (Window(TumblingEventTimeWindows(86400000), EventTimeTrigger, SumAggregator, PassThroughWindowFunction) -> Flat Map -> Sink: tidb) to 1 predecessors.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,389 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph       [] - Connecting input 0 of vertex 337adade1e207453ed3502e01d75fd03 (Window(TumblingEventTimeWindows(86400000), EventTimeTrigger, SumAggregator, PassThroughWindowFunction) -> Flat Map -> Sink: tidb) to intermediate result referenced via predecessor cbc357ccb763df2852fee8c4fc7d55f2 (Source: Custom Source -> format to json -> Filter -> process timestamp range -> Timestamps/Watermarks).
2021/3/4 上午10:33:25 2021-03-04 02:33:25,395 INFO  org.apache.flink.runtime.scheduler.adapter.DefaultExecutionTopology [] - Built 1 pipelined regions in 2 ms
2021/3/4 上午10:33:25 2021-03-04 02:33:25,396 DEBUG org.apache.flink.runtime.jobmaster.JobMaster                 [] - Successfully created execution graph from job graph TransactionAndAccount (00000000000000000000000000000000).
2021/3/4 上午10:33:25 2021-03-04 02:33:25,406 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using job/cluster config to configure application-defined state backend: File State Backend (checkpoints: 'oss://xx/backend', savepoints: 'null', asynchronous: TRUE, fileStateThreshold: 20480)
2021/3/4 上午10:33:25 2021-03-04 02:33:25,406 INFO  org.apache.flink.runtime.jobmaster.JobMaster                 [] - Using application-defined state backend: File State Backend (checkpoints: 'oss://xx/backend', savepoints: 'null', asynchronous: TRUE, fileStateThreshold: 20480)
2021/3/4 上午10:33:25 2021-03-04 02:33:25,419 INFO  org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss          [] - [Server]Unable to execute HTTP request: Not Found
2021/3/4 上午10:33:25 [ErrorCode]: NoSuchKey
2021/3/4 上午10:33:25 [RequestId]: 604046F58B49C830320A1A53
2021/3/4 上午10:33:25 [HostId]: null
2021/3/4 上午10:33:25 2021-03-04 02:33:25,432 INFO  org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss          [] - [Server]Unable to execute HTTP request: Not Found
2021/3/4 上午10:33:25 [ErrorCode]: NoSuchKey
2021/3/4 上午10:33:25 [RequestId]: 604046F58B49C830322A1A53
2021/3/4 上午10:33:25 [HostId]: null
2021/3/4 上午10:33:25 2021-03-04 02:33:25,442 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Recovering checkpoints from KubernetesStateHandleStore{configMapName='demo-00000000000000000000000000000000-jobmanager-leader'}.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,448 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Found 1 checkpoints in KubernetesStateHandleStore{configMapName='demo-00000000000000000000000000000000-jobmanager-leader'}.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,449 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to fetch 1 checkpoints from storage.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,449 INFO  org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore [] - Trying to retrieve checkpoint 10167.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,483 DEBUG org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Status of the shared state registry of job 00000000000000000000000000000000 after restore: SharedStateRegistry{registeredStates={}}.
2021/3/4 上午10:33:25 2021-03-04 02:33:25,483 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Restoring job 00000000000000000000000000000000 from Checkpoint 10167 @ 1614825175716 for 00000000000000000000000000000000 located at oss://xx/backend/00000000000000000000000000000000/chk-10167.

检查了jobmanager日志,同样存在此报错NoSuchKey


2021年3月3日 上午11:23,王 羽凡 <[hidden email]<mailto:[hidden email]>> 写道:

版本:Flink 1.12.0
环境:Native Kubernetes
模式:Application Mode

描述:
Flink以Native Kubernetes Application模式运行在k8s时,使用filesystem OSS作为backend发现日志请求OSS报错。
当代码使用`source.setStartFromEarliest();`,启动job之后从头开始消费,运行过程正常,运行到最新点位后会出现以下报错,过一段时间或者重启job之后报错消失。
当代码使用`source.setStartFromLatest();`,启动job之后直接从最新点位开始消费,则不会出现此报错。
据观察请问是我哪里配置或者使用有问题么?

命令:

./bin/flink run-application \
    --target kubernetes-application \
    -Dkubernetes.cluster-id=demo \
    -Dkubernetes.container.image=xx/xx/xx:2.0.16 \
    -Dstate.backend=filesystem \
    -Dstate.checkpoints.dir=oss://bucket/文件夹<oss://bucket/%E6%96%87%E4%BB%B6%E5%A4%B9> \
    -Dfs.oss.endpoint=oss-cn-beijing-internal.aliyuncs.com<http://oss-cn-beijing-internal.aliyuncs.com/> \
    -Dfs.oss.accessKeyId=xx \
    -Dfs.oss.accessKeySecret=xx \
    local:///opt/flink/usrlib/my-flink-job.jar

报错日志:

2021-03-03 02:53:46,133 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12701:1:-1:4 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午10:53:46 2021-03-03 02:53:46,140 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12701:1:-1:4 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午10:53:50 2021-03-03 02:53:50,899 INFO  org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss          [] - [Server]Unable to execute HTTP request: Not Found
2021/3/3 上午10:53:50 [ErrorCode]: NoSuchKey
2021/3/3 上午10:53:50 [RequestId]: xx
2021/3/3 上午10:53:50 [HostId]: null
2021/3/3 上午10:53:50 2021-03-03 02:53:50,904 INFO  org.apache.flink.fs.osshadoop.shaded.com.aliyun.oss          [] - [Server]Unable to execute HTTP request: Not Found
2021/3/3 上午10:53:50 [ErrorCode]: NoSuchKey
2021/3/3 上午10:53:50 [RequestId]: xx
2021/3/3 上午10:53:50 [HostId]: null

kill进程pod重启或过一段时间后taskManager正常日志:

2021-03-03 03:18:21,602 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:26 2021-03-03 03:18:26,573 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:26 2021-03-03 03:18:26,582 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:31 2021-03-03 03:18:31,571 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:31 2021-03-03 03:18:31,580 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:36 2021-03-03 03:18:36,633 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Committing offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}
2021/3/3 上午11:18:36 2021-03-03 03:18:36,642 INFO  org.apache.flink.streaming.connectors.pulsar.internal.PulsarMetadataReader [] - Successfully committed offset 12716:7:-1:1 to topic TopicRange{topic=persistent://public/xx/xxxx, key-range=SerializableRange{range=[0, 65535]}}

oss内文件:
<粘贴的图形-1.png>
chk-10880目录:
<粘贴的图形-2.png>

Reply | Threaded
Open this post in threaded view
|

Re: flink Application Native k8s使用oss作为backend日志偶尔报错

seuzxc
请问您这个问题解决了吗,我的也有这个错误信息



--
Sent from: http://apache-flink.147419.n8.nabble.com/