Flink大state读取磁盘,磁盘IO打满,任务相互影响的问题探讨

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink大state读取磁盘,磁盘IO打满,任务相互影响的问题探讨

jackjiang
Hi everyone:

如题,Flink大state读取磁盘,磁盘IO打满,任务相互影响的问题

尝试手段:

1. 手动迁移IO比较高的任务到其他机器,但是yarn任务提交比较随机,只能偶尔为之

2. 目前没有SSD,只能用普通STATA盘,目前加了两块盘提示磁盘IO能力,但是单盘对单任务的磁盘IO瓶颈还在

还有哪些策略可以解决或者缓解么?

Regards,
JackJiang
Reply | Threaded
Open this post in threaded view
|

Re: Flink大state读取磁盘,磁盘IO打满,任务相互影响的问题探讨

Wesley Peng-3


on 2019/9/10 13:47, 蒋涛涛 wrote:
> 尝试手段:
>
> 1. 手动迁移IO比较高的任务到其他机器,但是yarn任务提交比较随机,只能偶尔为之
>
> 2. 目前没有SSD,只能用普通STATA盘,目前加了两块盘提示磁盘IO能力,但是单盘对单任务的磁盘IO瓶颈还在
>
> 还有哪些策略可以解决或者缓解么?

It seems the tricks to improve RocksDB's throughput might be helpfu.

With writes and reads accessing mostly the recent data, our goal is to
let them stay in memory as much as possible without using up all the
memory on the server. The following parameters are worth tuning:

Block cache size: When uncompressed blocks are read from SSTables, they
are cached in memory. The amount of data that can be stored before
eviction policies apply is determined by the block cache size. The
bigger the better.

Write buffer size: How big can Memtable get before it is frozen.
Generally, the bigger the better. The tradeoff is that big write buffer
takes more memory and longer to flush to disk and to recover.

Write buffer number: How many Memtables to keep before flushing to
SSTable. Generally, the bigger the better. Similarly, the tradeoff is
that too many write buffers take up more memory and longer to flush to disk.

Minimum write buffers to merge: If most recently written keys are
frequently changed, it is better to only flush the latest version to
SSTable. This parameter controls how many Memtables it will try to merge
before flushing to SSTable. It should be less than the write buffer
number. A suggested value is 2. If the number is too big, it takes
longer to merge buffers and there is less chance of duplicate keys in
that many buffers.

The list above is far from being exhaustive, but tuning them correctly
can have a big impact on performance. Please refer to RocksDB’s Tuning
Guide for more details on these parameters. Figuring out the optimal
combination of values for all of them is an art in itself.

please ref: https://klaviyo.tech/flinkperf-c7bd28acc67

regards.
Reply | Threaded
Open this post in threaded view
|

Re: Flink大state读取磁盘,磁盘IO打满,任务相互影响的问题探讨

Biao Liu
Hello,

IO 量这么大符合预期吗?而且是读硬盘打满。
有没有尝试过调优?
1. 业务方面的调优,例如对 state 的使用是否合理
2. 系统层面的调优,例如 incremental checkpoint [1]

1.
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/checkpointing.html#state-backend-incremental

Thanks,
Biao /'bɪ.aʊ/



On Tue, 10 Sep 2019 at 14:39, Wesley Peng <[hidden email]> wrote:

>
>
> on 2019/9/10 13:47, 蒋涛涛 wrote:
> > 尝试手段:
> >
> > 1. 手动迁移IO比较高的任务到其他机器,但是yarn任务提交比较随机,只能偶尔为之
> >
> > 2. 目前没有SSD,只能用普通STATA盘,目前加了两块盘提示磁盘IO能力,但是单盘对单任务的磁盘IO瓶颈还在
> >
> > 还有哪些策略可以解决或者缓解么?
>
> It seems the tricks to improve RocksDB's throughput might be helpfu.
>
> With writes and reads accessing mostly the recent data, our goal is to
> let them stay in memory as much as possible without using up all the
> memory on the server. The following parameters are worth tuning:
>
> Block cache size: When uncompressed blocks are read from SSTables, they
> are cached in memory. The amount of data that can be stored before
> eviction policies apply is determined by the block cache size. The
> bigger the better.
>
> Write buffer size: How big can Memtable get before it is frozen.
> Generally, the bigger the better. The tradeoff is that big write buffer
> takes more memory and longer to flush to disk and to recover.
>
> Write buffer number: How many Memtables to keep before flushing to
> SSTable. Generally, the bigger the better. Similarly, the tradeoff is
> that too many write buffers take up more memory and longer to flush to
> disk.
>
> Minimum write buffers to merge: If most recently written keys are
> frequently changed, it is better to only flush the latest version to
> SSTable. This parameter controls how many Memtables it will try to merge
> before flushing to SSTable. It should be less than the write buffer
> number. A suggested value is 2. If the number is too big, it takes
> longer to merge buffers and there is less chance of duplicate keys in
> that many buffers.
>
> The list above is far from being exhaustive, but tuning them correctly
> can have a big impact on performance. Please refer to RocksDB’s Tuning
> Guide for more details on these parameters. Figuring out the optimal
> combination of values for all of them is an art in itself.
>
> please ref: https://klaviyo.tech/flinkperf-c7bd28acc67
>
> regards.
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink大state读取磁盘,磁盘IO打满,任务相互影响的问题探讨

Congxian Qiu
Hi

像你描述的,单盘对单任务还存在 IO 瓶颈,这里是单 container 吗?像前面大家说的,你需要确认这么大的 IO
访问是符合预期的,如果符合预期的话,你可以尝试增加 blockcache 和 memtable 的大小,将更多的数据放到内存。

另外,你使用的是什么 state 类型,valuestate 和 liststate 的话,能否换成 mapstate 来处理。同时,你可以看下
rocksdb 的 log,看看是否有什么可以优化的地方


Best,
Congxian


Biao Liu <[hidden email]> 于2019年9月23日周一 下午2:39写道:

> Hello,
>
> IO 量这么大符合预期吗?而且是读硬盘打满。
> 有没有尝试过调优?
> 1. 业务方面的调优,例如对 state 的使用是否合理
> 2. 系统层面的调优,例如 incremental checkpoint [1]
>
> 1.
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/stream/state/checkpointing.html#state-backend-incremental
>
> Thanks,
> Biao /'bɪ.aʊ/
>
>
>
> On Tue, 10 Sep 2019 at 14:39, Wesley Peng <[hidden email]> wrote:
>
> >
> >
> > on 2019/9/10 13:47, 蒋涛涛 wrote:
> > > 尝试手段:
> > >
> > > 1. 手动迁移IO比较高的任务到其他机器,但是yarn任务提交比较随机,只能偶尔为之
> > >
> > > 2. 目前没有SSD,只能用普通STATA盘,目前加了两块盘提示磁盘IO能力,但是单盘对单任务的磁盘IO瓶颈还在
> > >
> > > 还有哪些策略可以解决或者缓解么?
> >
> > It seems the tricks to improve RocksDB's throughput might be helpfu.
> >
> > With writes and reads accessing mostly the recent data, our goal is to
> > let them stay in memory as much as possible without using up all the
> > memory on the server. The following parameters are worth tuning:
> >
> > Block cache size: When uncompressed blocks are read from SSTables, they
> > are cached in memory. The amount of data that can be stored before
> > eviction policies apply is determined by the block cache size. The
> > bigger the better.
> >
> > Write buffer size: How big can Memtable get before it is frozen.
> > Generally, the bigger the better. The tradeoff is that big write buffer
> > takes more memory and longer to flush to disk and to recover.
> >
> > Write buffer number: How many Memtables to keep before flushing to
> > SSTable. Generally, the bigger the better. Similarly, the tradeoff is
> > that too many write buffers take up more memory and longer to flush to
> > disk.
> >
> > Minimum write buffers to merge: If most recently written keys are
> > frequently changed, it is better to only flush the latest version to
> > SSTable. This parameter controls how many Memtables it will try to merge
> > before flushing to SSTable. It should be less than the write buffer
> > number. A suggested value is 2. If the number is too big, it takes
> > longer to merge buffers and there is less chance of duplicate keys in
> > that many buffers.
> >
> > The list above is far from being exhaustive, but tuning them correctly
> > can have a big impact on performance. Please refer to RocksDB’s Tuning
> > Guide for more details on these parameters. Figuring out the optimal
> > combination of values for all of them is an art in itself.
> >
> > please ref: https://klaviyo.tech/flinkperf-c7bd28acc67
> >
> > regards.
> >
>