checkpoint失败讨论

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

checkpoint失败讨论

yanggang_it_job
最近多个以rocksdb作为状态后端,hdfs作为远程文件系统的任务,频繁报错,这个报错有以下特征
1、报错之前这些任务都平稳运行,突然在某一天报错
2、当发现此类错误的时候,多个任务也会因相同的报错而导致checkpoint失败


报错信息如下
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/stream/flink-checkpoints/19523bf083346eb80b409167e9b91b53/chk-43396/cef72b90-8492-4b09-8d1b-384b0ebe5768 could only be replicated to 0 nodes instead of minReplication (=1). There are 8 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1723)


辛苦大家看看
谢谢
Reply | Threaded
Open this post in threaded view
|

Re: checkpoint失败讨论

Yun Tang
Hi

这个错误“could only be replicated to 0 nodes instead of minReplication (=1)”是HDFS不稳定导致的,无法将数据进行duplicate与Flink本身并无关系。

祝好
唐云

________________________________
From: yanggang_it_job <[hidden email]>
Sent: Monday, June 1, 2020 15:30
To: [hidden email] <[hidden email]>
Subject: checkpoint失败讨论

最近多个以rocksdb作为状态后端,hdfs作为远程文件系统的任务,频繁报错,这个报错有以下特征
1、报错之前这些任务都平稳运行,突然在某一天报错
2、当发现此类错误的时候,多个任务也会因相同的报错而导致checkpoint失败


报错信息如下
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/stream/flink-checkpoints/19523bf083346eb80b409167e9b91b53/chk-43396/cef72b90-8492-4b09-8d1b-384b0ebe5768 could only be replicated to 0 nodes instead of minReplication (=1). There are 8 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1723)


辛苦大家看看
谢谢