flink-1.11.2 rocksdb when trigger savepoint job fail and restart

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

flink-1.11.2 rocksdb when trigger savepoint job fail and restart

smailxie






我有一个sql job,跑的任务是双流jion,状态保留12 – 24小时,checkpoint是正常的,状态大小在300M到4G之间,当手动触发savepoint时,容器会被杀死,原因是超出内存限制(申请的内存是单slot 5G)。

我想问的是,rocksdb,在savepiont时,是把所有的磁盘状态读入内存,然后再全量快照?

如果是这样,后续版本有没有优化?不然每次磁盘状态超过托管内存,一手动savepoint,job就会被杀死。

下面是报错信息。

 

2020-12-10 09:18:50

java.lang.Exception: Container [pid=33290,containerID=container_e47_1594105654926_6890682_01_000002] is running beyond physical memory limits. Current usage: 5.1 GB of 5 GB physical memory used; 7.4 GB of 10.5 GB virtual memory used. Killing container.

Dump of the process-tree for container_e47_1594105654926_6890682_01_000002 :

      |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

      |- 33334 33290 33290 33290 (java) 787940 76501 7842340864 1337121 /usr/java/default/bin/java -Xmx1234802980 -Xms1234802980 -XX:MaxDirectMemorySize=590558009 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=456340281b -D taskmanager.memory.network.min=456340281b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=2738041755b -D taskmanager.cpu.cores=5.0 -D taskmanager.memory.task.heap.size=1100585252b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Djobmanager.rpc.address=shd177.yonghui.cn -Dpipeline.classpaths= -Dweb.port=0 -Dexecution.target=embedded -Dweb.tmpdir=/tmp/flink-web-c415ad8e-c019-4398-869d-7c9e540c2479 -Djobmanager.rpc.port=44058 -Dpipeline.jars=file:/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/yh-datacenter-platform-flink-1.0.0.jar -Drest.address=shd177.yonghui.cn -Dsecurity.kerberos.login.keytab=/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/krb5.keytab

      |- 33290 33288 33290 33290 (bash) 0 0 108679168 318 /bin/bash -c /usr/java/default/bin/java -Xmx1234802980 -Xms1234802980 -XX:MaxDirectMemorySize=590558009 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=456340281b -D taskmanager.memory.network.min=456340281b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=2738041755b -D taskmanager.cpu.cores=5.0 -D taskmanager.memory.task.heap.size=1100585252b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Djobmanager.rpc.address='shd177.yonghui.cn' -Dpipeline.classpaths='' -Dweb.port='0' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-c415ad8e-c019-4398-869d-7c9e540c2479' -Djobmanager.rpc.port='44058' -Dpipeline.jars='file:/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/yh-datacenter-platform-flink-1.0.0.jar' -Drest.address='shd177.yonghui.cn' -Dsecurity.kerberos.login.keytab='/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/krb5.keytab' 1> /data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.out 2> /data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.err

 

Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143

 

发送自 Windows 10 版邮件应用

 







--

Name:谢波
Mobile:13764228893
Reply | Threaded
Open this post in threaded view
|

Re: flink-1.11.2 rocksdb when trigger savepoint job fail and restart

Yun Tang
Hi

请问你的TM是单slot吧,managed memory是多少? RocksDB state backend在执行savepoint的时候,会使用一个iterator来遍历数据,所以会存在额外的内存开销(并且该部分开销并不属于write buffer与block cache管理的部分),当然RocksDB的iterator是一个多层的最小堆iterator,理论上来说占用的临时内存并不会太多。不知你们能否将程序抽象成一个必现的demo来给我们做debug呢?

至于如何解决该问题,可以考虑增大JVM overhead [1] 来增大该部分的buffer空间。

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12//deployment/memory/mem_setup.html

祝好
唐云
________________________________
From: smailxie <[hidden email]>
Sent: Thursday, December 10, 2020 17:42
To: [hidden email] <[hidden email]>
Subject: flink-1.11.2 rocksdb when trigger savepoint job fail and restart







我有一个sql job,跑的任务是双流jion,状态保留12 �C 24小时,checkpoint是正常的,状态大小在300M到4G之间,当手动触发savepoint时,容器会被杀死,原因是超出内存限制(申请的内存是单slot 5G)。

我想问的是,rocksdb,在savepiont时,是把所有的磁盘状态读入内存,然后再全量快照?

如果是这样,后续版本有没有优化?不然每次磁盘状态超过托管内存,一手动savepoint,job就会被杀死。

下面是报错信息。



2020-12-10 09:18:50

java.lang.Exception: Container [pid=33290,containerID=container_e47_1594105654926_6890682_01_000002] is running beyond physical memory limits. Current usage: 5.1 GB of 5 GB physical memory used; 7.4 GB of 10.5 GB virtual memory used. Killing container.

Dump of the process-tree for container_e47_1594105654926_6890682_01_000002 :

      |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE

      |- 33334 33290 33290 33290 (java) 787940 76501 7842340864 1337121 /usr/java/default/bin/java -Xmx1234802980 -Xms1234802980 -XX:MaxDirectMemorySize=590558009 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=456340281b -D taskmanager.memory.network.min=456340281b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=2738041755b -D taskmanager.cpu.cores=5.0 -D taskmanager.memory.task.heap.size=1100585252b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Djobmanager.rpc.address=shd177.yonghui.cn -Dpipeline.classpaths= -Dweb.port=0 -Dexecution.target=embedded -Dweb.tmpdir=/tmp/flink-web-c415ad8e-c019-4398-869d-7c9e540c2479 -Djobmanager.rpc.port=44058 -Dpipeline.jars=file:/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/yh-datacenter-platform-flink-1.0.0.jar -Drest.address=shd177.yonghui.cn -Dsecurity.kerberos.login.keytab=/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/krb5.keytab

      |- 33290 33288 33290 33290 (bash) 0 0 108679168 318 /bin/bash -c /usr/java/default/bin/java -Xmx1234802980 -Xms1234802980 -XX:MaxDirectMemorySize=590558009 -XX:MaxMetaspaceSize=268435456 -Dlog.file=/data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.log -Dlog4j.configuration=file:./log4j.properties -Dlog4j.configurationFile=file:./log4j.properties org.apache.flink.yarn.YarnTaskExecutorRunner -D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=456340281b -D taskmanager.memory.network.min=456340281b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=2738041755b -D taskmanager.cpu.cores=5.0 -D taskmanager.memory.task.heap.size=1100585252b -D taskmanager.memory.task.off-heap.size=0b --configDir . -Djobmanager.rpc.address='shd177.yonghui.cn' -Dpipeline.classpaths='' -Dweb.port='0' -Dexecution.target='embedded' -Dweb.tmpdir='/tmp/flink-web-c415ad8e-c019-4398-869d-7c9e540c2479' -Djobmanager.rpc.port='44058' -Dpipeline.jars='file:/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/yh-datacenter-platform-flink-1.0.0.jar' -Drest.address='shd177.yonghui.cn' -Dsecurity.kerberos.login.keytab='/data1/yarn/nm/usercache/xiebo/appcache/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000001/krb5.keytab' 1> /data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.out 2> /data6/yarn/container-logs/application_1594105654926_6890682/container_e47_1594105654926_6890682_01_000002/taskmanager.err



Container killed on request. Exit code is 143

Container exited with a non-zero exit code 143



发送自 Windows 10 版邮件应用









--

Name:谢波
Mobile:13764228893