在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢? 查询时间 查询结果 2020-10-10 19:00:00 100 2020-10-10 19:00:02 98 2020-10-10 19:00:04 102 2020-10-10 19:00:06 108 2020-10-10 19:00:08 106 2020-10-10 19:00:10 110 sql如下: create table jdbc_sink( date_str varchar , dau bigint, PRIMARY KEY (date_str) NOT ENFORCED ) with ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://xxx', 'table-name' = 'xxx', 'driver' = 'com.mysql.jdbc.Driver', 'username' = 'xxx', 'password' = 'xxx' ); CREATE TABLE action_log_source ( user_id varchar, event_time TIMESTAMP(3), WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND ) with ( ... ); INSERT INTO jdbc_sink SELECT day_str as date_str, COUNT(DISTINCT user_id) AS dau FROM ( select user_id as user_id, date_format(event_time, 'yyyy-MM-dd') as day_str from action_log_source ) GROUP BY day_str |
看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个
| | 熊云昆 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2020年10月17日 16:24,867127831 写道: 在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数 job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢? 查询时间 查询结果 2020-10-10 19:00:00 100 2020-10-10 19:00:02 98 2020-10-10 19:00:04 102 2020-10-10 19:00:06 108 2020-10-10 19:00:08 106 2020-10-10 19:00:10 110 sql如下: create table jdbc_sink( date_str varchar , dau bigint, PRIMARY KEY (date_str) NOT ENFORCED ) with ( 'connector' = 'jdbc', 'url' = 'jdbc:mysql://xxx', 'table-name' = 'xxx', 'driver' = 'com.mysql.jdbc.Driver', 'username' = 'xxx', 'password' = 'xxx' ); CREATE TABLE action_log_source ( user_id varchar, event_time TIMESTAMP(3), WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND ) with ( ... ); INSERT INTO jdbc_sink SELECT day_str as date_str, COUNT(DISTINCT user_id) AS dau FROM ( select user_id as user_id, date_format(event_time, 'yyyy-MM-dd') as day_str from action_log_source ) GROUP BY day_str |
这个数据量我是为了说明现象构造的数据来演示一下,不是真实的量,真实的数量据在千万级别的。
------------------ 原始邮件 ------------------ 发件人: "user-zh" <[hidden email]>; 发送时间: 2020年10月19日(星期一) 上午8:03 收件人: "867127831"<[hidden email]>; 抄送: "user-zh"<[hidden email]>; 主题: 回复:flink sql count distinct结果会变小 看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个 | | 熊云昆 | | 邮箱:[hidden email] | 签名由 网易邮箱大师 定制 在2020年10月17日 16:24,867127831 写道: 在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数 job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢? &nbsp; &nbsp; &nbsp; &nbsp; 查询时间&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 查询结果 2020-10-10 19:00:00&nbsp; &nbsp;100 2020-10-10 19:00:02&nbsp; &nbsp;98 2020-10-10 19:00:04&nbsp; &nbsp;102 2020-10-10 19:00:06&nbsp; &nbsp;108 2020-10-10 19:00:08&nbsp; &nbsp;106 2020-10-10 19:00:10&nbsp; &nbsp;110 sql如下: create table jdbc_sink( &nbsp; &nbsp; date_str varchar , &nbsp; &nbsp; dau bigint, &nbsp; &nbsp; PRIMARY KEY (date_str) NOT ENFORCED ) with ( &nbsp; 'connector' = 'jdbc', &nbsp; 'url' = 'jdbc:mysql://xxx', &nbsp; 'table-name' = 'xxx', &nbsp; 'driver' = 'com.mysql.jdbc.Driver', &nbsp; 'username' = 'xxx', &nbsp; 'password' = 'xxx' ); CREATE TABLE action_log_source ( &nbsp; user_id varchar, &nbsp; event_time TIMESTAMP(3), &nbsp; WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND ) with ( &nbsp; ... ); INSERT INTO &nbsp; jdbc_sink SELECT &nbsp; day_str as date_str, &nbsp; COUNT(DISTINCT user_id) AS dau FROM ( &nbsp; select&nbsp; &nbsp; &nbsp; &nbsp; user_id as user_id, &nbsp; &nbsp; &nbsp; date_format(event_time, 'yyyy-MM-dd') as day_str &nbsp; from action_log_source ) GROUP BY day_str |
In reply to this post by 熊云昆
我以前出现回撤的时候 出现过变小,也就是中途出现过失败等,后面又恢复了
在 2020-10-19 08:03:46,"熊云昆" <[hidden email]> 写道: >看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个 > > > >| | >熊云昆 >| >| >邮箱:[hidden email] >| > >签名由 网易邮箱大师 定制 > >在2020年10月17日 16:24,867127831 写道: >在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数 >job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢? > > > 查询时间 查询结果 >2020-10-10 19:00:00 100 >2020-10-10 19:00:02 98 >2020-10-10 19:00:04 102 >2020-10-10 19:00:06 108 >2020-10-10 19:00:08 106 >2020-10-10 19:00:10 110 > > >sql如下: >create table jdbc_sink( > date_str varchar , > dau bigint, > PRIMARY KEY (date_str) NOT ENFORCED >) with ( > 'connector' = 'jdbc', > 'url' = 'jdbc:mysql://xxx', > 'table-name' = 'xxx', > 'driver' = 'com.mysql.jdbc.Driver', > 'username' = 'xxx', > 'password' = 'xxx' >); > > >CREATE TABLE action_log_source ( > user_id varchar, > event_time TIMESTAMP(3), > WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND >) with ( > ... >); > > >INSERT INTO > jdbc_sink >SELECT > day_str as date_str, > COUNT(DISTINCT user_id) AS dau >FROM ( > select > user_id as user_id, > date_format(event_time, 'yyyy-MM-dd') as day_str > from action_log_source >) >GROUP BY day_str |
Free forum by Nabble | Edit this page |