flink1.9 Blink sql 丢失主键+去重和时态表联合使用吞吐量低

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

flink1.9 Blink sql 丢失主键+去重和时态表联合使用吞吐量低

Dream-底限
hi、
我这面使用flink1.9的Blink sql完成数据转换操作,但遇到如下问题:
1、使用row_number函数丢失主键
2、row_number函数和时态表关联联合使用程序吞吐量严重降低,对应sql如下:
// 理论上这里面是不需要 distinct的,但sql中的主键blink提取不出来导致校验不通过,所以加了一个
SELECT distinct t1.id as order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY data.index0.id
ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
rowNum<=1) t1 left join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF
t1.proctime t2 on t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR
SYSTEM_TIME AS OF t1.proctime t4 ON t1.so_id =t4.ID
上面的sql吞吐率很低,每秒就处理几条数据,而下面两种情况分开跑,吞吐量都能达标,仅时态表关联能到到几千条,仅rownumber能达到几万条,但不知道为什么他们俩联合后就只有几条了

SELECT distinct t1.id as order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
data.index0.id,...,proctime from installmentdb_t_line_item)tmp ) t1 left
join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF t1.proctime t2 on
t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR SYSTEM_TIME AS OF
t1.proctime t4 ON t1.so_id =t4.ID

SELECT distinct t1.id as order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY data.index0.id
ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
rowNum<=1) t1
Reply | Threaded
Open this post in threaded view
|

Re: flink1.9 Blink sql 丢失主键+去重和时态表联合使用吞吐量低

Benchao Li
Hi,

你用的是什么state backend呢?看你的情况很有可能跟这个有关系。比如用的是rocksdb,然后是普通磁盘的话,很容易遇到IO瓶颈。

宇张 <[hidden email]> 于2020年5月11日周一 上午11:14写道:

> hi、
> 我这面使用flink1.9的Blink sql完成数据转换操作,但遇到如下问题:
> 1、使用row_number函数丢失主键
> 2、row_number函数和时态表关联联合使用程序吞吐量严重降低,对应sql如下:
> // 理论上这里面是不需要 distinct的,但sql中的主键blink提取不出来导致校验不通过,所以加了一个
> SELECT distinct t1.id as order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY data.index0.id
> ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
> rowNum<=1) t1 left join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF
> t1.proctime t2 on t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR
> SYSTEM_TIME AS OF t1.proctime t4 ON t1.so_id =t4.ID
>
> 上面的sql吞吐率很低,每秒就处理几条数据,而下面两种情况分开跑,吞吐量都能达标,仅时态表关联能到到几千条,仅rownumber能达到几万条,但不知道为什么他们俩联合后就只有几条了
>
> SELECT distinct t1.id as order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> data.index0.id,...,proctime from installmentdb_t_line_item)tmp ) t1 left
> join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF t1.proctime t2 on
> t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR SYSTEM_TIME AS OF
> t1.proctime t4 ON t1.so_id =t4.ID
>
> SELECT distinct t1.id as order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY data.index0.id
> ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
> rowNum<=1) t1
>


--

Benchao Li
School of Electronics Engineering and Computer Science, Peking University
Tel:+86-15650713730
Email: [hidden email]; [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: flink1.9 Blink sql 丢失主键+去重和时态表联合使用吞吐量低

Dream-底限
hi、
我这面state backend用的是FsStateBackend,状态保存在hdfs

On Mon, May 11, 2020 at 11:19 AM Benchao Li <[hidden email]> wrote:

> Hi,
>
> 你用的是什么state backend呢?看你的情况很有可能跟这个有关系。比如用的是rocksdb,然后是普通磁盘的话,很容易遇到IO瓶颈。
>
> 宇张 <[hidden email]> 于2020年5月11日周一 上午11:14写道:
>
> > hi、
> > 我这面使用flink1.9的Blink sql完成数据转换操作,但遇到如下问题:
> > 1、使用row_number函数丢失主键
> > 2、row_number函数和时态表关联联合使用程序吞吐量严重降低,对应sql如下:
> > // 理论上这里面是不需要 distinct的,但sql中的主键blink提取不出来导致校验不通过,所以加了一个
> > SELECT distinct t1.id as
> order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> > HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> > data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY
> data.index0.id
> > ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
> > rowNum<=1) t1 left join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF
> > t1.proctime t2 on t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR
> > SYSTEM_TIME AS OF t1.proctime t4 ON t1.so_id =t4.ID
> >
> >
> 上面的sql吞吐率很低,每秒就处理几条数据,而下面两种情况分开跑,吞吐量都能达标,仅时态表关联能到到几千条,仅rownumber能达到几万条,但不知道为什么他们俩联合后就只有几条了
> >
> > SELECT distinct t1.id as
> order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> > HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> > data.index0.id,...,proctime from installmentdb_t_line_item)tmp ) t1 left
> > join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF t1.proctime t2 on
> > t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR SYSTEM_TIME AS OF
> > t1.proctime t4 ON t1.so_id =t4.ID
> >
> > SELECT distinct t1.id as
> order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> > HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> > data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY
> data.index0.id
> > ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
> > rowNum<=1) t1
> >
>
>
> --
>
> Benchao Li
> School of Electronics Engineering and Computer Science, Peking University
> Tel:+86-15650713730
> Email: [hidden email]; [hidden email]
>
Reply | Threaded
Open this post in threaded view
|

Re: Re: flink1.9 Blink sql 丢失主键+去重和时态表联合使用吞吐量低

刘大龙
Hi,
我觉得你可以多试几种方式,比如先关联再去重,测试一下性能呢?另外,去重的话,你这个业务逻辑是用es字段排序取最新,不能用procetime去重取最新吗?你的sql中rowNum<=1实际上生成的应该是Rank算子,不是Deduplication算子吧。我在业务中对订单类型的数据是这样用去重算子的,select * from (select *, row_number() over(partition by id order by proctime desc as rowNum from xxx) tmp where rowNum = 1,这样的语法才会转换去Deduplication算子。


> -----原始邮件-----
> 发件人: "宇张" <[hidden email]>
> 发送时间: 2020-05-11 11:40:37 (星期一)
> 收件人: [hidden email]
> 抄送:
> 主题: Re: flink1.9 Blink sql 丢失主键+去重和时态表联合使用吞吐量低
>
> hi、
> 我这面state backend用的是FsStateBackend,状态保存在hdfs
>
> On Mon, May 11, 2020 at 11:19 AM Benchao Li <[hidden email]> wrote:
>
> > Hi,
> >
> > 你用的是什么state backend呢?看你的情况很有可能跟这个有关系。比如用的是rocksdb,然后是普通磁盘的话,很容易遇到IO瓶颈。
> >
> > 宇张 <[hidden email]> 于2020年5月11日周一 上午11:14写道:
> >
> > > hi、
> > > 我这面使用flink1.9的Blink sql完成数据转换操作,但遇到如下问题:
> > > 1、使用row_number函数丢失主键
> > > 2、row_number函数和时态表关联联合使用程序吞吐量严重降低,对应sql如下:
> > > // 理论上这里面是不需要 distinct的,但sql中的主键blink提取不出来导致校验不通过,所以加了一个
> > > SELECT distinct t1.id as
> > order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> > > HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> > > data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY
> > data.index0.id
> > > ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
> > > rowNum<=1) t1 left join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF
> > > t1.proctime t2 on t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR
> > > SYSTEM_TIME AS OF t1.proctime t4 ON t1.so_id =t4.ID
> > >
> > >
> > 上面的sql吞吐率很低,每秒就处理几条数据,而下面两种情况分开跑,吞吐量都能达标,仅时态表关联能到到几千条,仅rownumber能达到几万条,但不知道为什么他们俩联合后就只有几条了
> > >
> > > SELECT distinct t1.id as
> > order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> > > HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> > > data.index0.id,...,proctime from installmentdb_t_line_item)tmp ) t1 left
> > > join SNAP_T_OPEN_PAY_ORDER FOR SYSTEM_TIME AS OF t1.proctime t2 on
> > > t2.LI_ID= t1.id left join SNAP_T_SALES_ORDER FOR SYSTEM_TIME AS OF
> > > t1.proctime t4 ON t1.so_id =t4.ID
> > >
> > > SELECT distinct t1.id as
> > order_id,...,DATE_FORMAT(t1.proctime,'yyyy-MM-dd
> > > HH:mm:ss') as etl_time FROM (select id,...,proctime from (select
> > > data.index0.id,...,proctime,ROW_NUMBER() OVER (PARTITION BY
> > data.index0.id
> > > ORDER BY es desc) AS rowNum from installmentdb_t_line_item)tmp where
> > > rowNum<=1) t1
> > >
> >
> >
> > --
> >
> > Benchao Li
> > School of Electronics Engineering and Computer Science, Peking University
> > Tel:+86-15650713730
> > Email: [hidden email]; [hidden email]
> >


------------------------------
刘大龙

浙江大学 控制系 智能系统与控制研究所 工控新楼217
地址:浙江省杭州市浙大路38号浙江大学玉泉校区
Tel:18867547281