FlinkSQL1.11.1读取kafka写入Hive(parquet) OOM问题

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

FlinkSQL1.11.1读取kafka写入Hive(parquet) OOM问题

wangenbao
求教各位大佬:
有遇到如下问题的吗?

1、我首先通过TableAPI读取Kafka中PB格式数据,转换成POJO对象,然后注册成View;
2、然后Insert into到三分区(日,小时,hashtid)的Hive表(Parquet格式Snappy压缩)中;
3、数据的分区相对分散些就会出现OOM问题,具体表现为
parquet.hadoop.MemoryManager: Total allocation exceeds 50.00% (2,102,394,880
bytes) of heap memory
Scaling row group sizes to 13.62% for 115 writers
随后就会出现java.lang.OutOfMemoryError: Java heap space

我认为是Parquet的Writer数比较多,不知道大佬遇见过类似问题吗,该如何解决啊



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: FlinkSQL1.11.1读取kafka写入Hive(parquet) OOM问题

Jingsong Li
可以考虑在写之前按照hashtid keyBy下吗?

Best,
Jingsong

On Wed, Sep 16, 2020 at 9:36 AM wangenbao <[hidden email]> wrote:

> 求教各位大佬:
> 有遇到如下问题的吗?
>
> 1、我首先通过TableAPI读取Kafka中PB格式数据,转换成POJO对象,然后注册成View;
> 2、然后Insert into到三分区(日,小时,hashtid)的Hive表(Parquet格式Snappy压缩)中;
> 3、数据的分区相对分散些就会出现OOM问题,具体表现为
> parquet.hadoop.MemoryManager: Total allocation exceeds 50.00%
> (2,102,394,880
> bytes) of heap memory
> Scaling row group sizes to 13.62% for 115 writers
> 随后就会出现java.lang.OutOfMemoryError: Java heap space
>
> 我认为是Parquet的Writer数比较多,不知道大佬遇见过类似问题吗,该如何解决啊
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/



--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: FlinkSQL1.11.1读取kafka写入Hive(parquet) OOM问题

wangenbao
感谢回复
目前确实使用keyBy,能把并行度提高,分散数据到多个TaskManager中,但遇见个问题
<http://apache-flink.147419.n8.nabble.com/file/t959/QQ%E6%88%AA%E5%9B%BE20200916221935.png>
<http://apache-flink.147419.n8.nabble.com/file/t959/QQ%E6%88%AA%E5%9B%BE20200916222005.png>

不知道能不能直接控制Insert语句的并行度



--
Sent from: http://apache-flink.147419.n8.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: FlinkSQL1.11.1读取kafka写入Hive(parquet) OOM问题

Jingsong Li
你指的可能是控制sink的并行度,这个一直在讨论中

On Wed, Sep 16, 2020 at 10:26 PM wangenbao <[hidden email]> wrote:

> 感谢回复
> 目前确实使用keyBy,能把并行度提高,分散数据到多个TaskManager中,但遇见个问题
> <
> http://apache-flink.147419.n8.nabble.com/file/t959/QQ%E6%88%AA%E5%9B%BE20200916221935.png>
>
> <
> http://apache-flink.147419.n8.nabble.com/file/t959/QQ%E6%88%AA%E5%9B%BE20200916222005.png>
>
>
> 不知道能不能直接控制Insert语句的并行度
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: FlinkSQL1.11.1读取kafka写入Hive(parquet) OOM问题

wangenbao
In reply to this post by Jingsong Li
这个问题的关键应该是你在
http://apache-flink.147419.n8.nabble.com/StreamingFileWriter-td7161.html
<http://apache-flink.147419.n8.nabble.com/StreamingFileWriter-td7161.html>
中回复的:Flink1.11.2解了一个bug:https://issues.apache.org/jira/browse/FLINK-19121
我这边也设置了table.exec.hive.fallback-mapred-writer=false



--
Sent from: http://apache-flink.147419.n8.nabble.com/