flink sql count distinct结果会变小

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

flink sql count distinct结果会变小

Jun He
在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?


        查询时间              查询结果
2020-10-10 19:00:00   100
2020-10-10 19:00:02   98
2020-10-10 19:00:04   102
2020-10-10 19:00:06   108
2020-10-10 19:00:08   106
2020-10-10 19:00:10   110


sql如下:
create table jdbc_sink(
    date_str varchar ,
    dau bigint,
    PRIMARY KEY (date_str) NOT ENFORCED
) with (
  'connector' = 'jdbc',
  'url' = 'jdbc:mysql://xxx',
  'table-name' = 'xxx',
  'driver' = 'com.mysql.jdbc.Driver',
  'username' = 'xxx',
  'password' = 'xxx'
);


CREATE TABLE action_log_source (
  user_id varchar,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) with (
  ...
);


INSERT INTO
  jdbc_sink
SELECT
  day_str as date_str,
  COUNT(DISTINCT user_id) AS dau
FROM (
  select 
      user_id as user_id,
      date_format(event_time, 'yyyy-MM-dd') as day_str
  from action_log_source
)
GROUP BY day_str
Reply | Threaded
Open this post in threaded view
|

回复:flink sql count distinct结果会变小

熊云昆
看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个



| |
熊云昆
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

在2020年10月17日 16:24,867127831 写道:
在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?


        查询时间              查询结果
2020-10-10 19:00:00   100
2020-10-10 19:00:02   98
2020-10-10 19:00:04   102
2020-10-10 19:00:06   108
2020-10-10 19:00:08   106
2020-10-10 19:00:10   110


sql如下:
create table jdbc_sink(
    date_str varchar ,
    dau bigint,
    PRIMARY KEY (date_str) NOT ENFORCED
) with (
  'connector' = 'jdbc',
  'url' = 'jdbc:mysql://xxx',
  'table-name' = 'xxx',
  'driver' = 'com.mysql.jdbc.Driver',
  'username' = 'xxx',
  'password' = 'xxx'
);


CREATE TABLE action_log_source (
  user_id varchar,
  event_time TIMESTAMP(3),
  WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) with (
  ...
);


INSERT INTO
  jdbc_sink
SELECT
  day_str as date_str,
  COUNT(DISTINCT user_id) AS dau
FROM (
  select 
      user_id as user_id,
      date_format(event_time, 'yyyy-MM-dd') as day_str
  from action_log_source
)
GROUP BY day_str
Reply | Threaded
Open this post in threaded view
|

回复:flink sql count distinct结果会变小

Jun He
这个数据量我是为了说明现象构造的数据来演示一下,不是真实的量,真实的数量据在千万级别的。




------------------ 原始邮件 ------------------
发件人:                                                                                                                        "user-zh"                                                                                    <[hidden email]&gt;;
发送时间:&nbsp;2020年10月19日(星期一) 上午8:03
收件人:&nbsp;"867127831"<[hidden email]&gt;;
抄送:&nbsp;"user-zh"<[hidden email]&gt;;
主题:&nbsp;回复:flink sql count distinct结果会变小



看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个



| |
熊云昆
|
|
邮箱:[hidden email]
|

签名由 网易邮箱大师 定制

在2020年10月17日 16:24,867127831 写道:
在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?


&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 查询时间&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 查询结果
2020-10-10 19:00:00&amp;nbsp; &amp;nbsp;100
2020-10-10 19:00:02&amp;nbsp; &amp;nbsp;98
2020-10-10 19:00:04&amp;nbsp; &amp;nbsp;102
2020-10-10 19:00:06&amp;nbsp; &amp;nbsp;108
2020-10-10 19:00:08&amp;nbsp; &amp;nbsp;106
2020-10-10 19:00:10&amp;nbsp; &amp;nbsp;110


sql如下:
create table jdbc_sink(
&amp;nbsp; &amp;nbsp; date_str varchar ,
&amp;nbsp; &amp;nbsp; dau bigint,
&amp;nbsp; &amp;nbsp; PRIMARY KEY (date_str) NOT ENFORCED
) with (
&amp;nbsp; 'connector' = 'jdbc',
&amp;nbsp; 'url' = 'jdbc:mysql://xxx',
&amp;nbsp; 'table-name' = 'xxx',
&amp;nbsp; 'driver' = 'com.mysql.jdbc.Driver',
&amp;nbsp; 'username' = 'xxx',
&amp;nbsp; 'password' = 'xxx'
);


CREATE TABLE action_log_source (
&amp;nbsp; user_id varchar,
&amp;nbsp; event_time TIMESTAMP(3),
&amp;nbsp; WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
) with (
&amp;nbsp; ...
);


INSERT INTO
&amp;nbsp; jdbc_sink
SELECT
&amp;nbsp; day_str as date_str,
&amp;nbsp; COUNT(DISTINCT user_id) AS dau
FROM (
&amp;nbsp; select&amp;nbsp;
&amp;nbsp; &amp;nbsp; &amp;nbsp; user_id as user_id,
&amp;nbsp; &amp;nbsp; &amp;nbsp; date_format(event_time, 'yyyy-MM-dd') as day_str
&amp;nbsp; from action_log_source
)
GROUP BY day_str
Reply | Threaded
Open this post in threaded view
|

Re:回复:flink sql count distinct结果会变小

Michael Ran
In reply to this post by 熊云昆
我以前出现回撤的时候 出现过变小,也就是中途出现过失败等,后面又恢复了
在 2020-10-19 08:03:46,"熊云昆" <[hidden email]> 写道:

>看你的数据量不大,可以考虑输出user_id详情出来排查看看到底少了哪个
>
>
>
>| |
>熊云昆
>|
>|
>邮箱:[hidden email]
>|
>
>签名由 网易邮箱大师 定制
>
>在2020年10月17日 16:24,867127831 写道:
>在flink sql中计算网站的dau,对日期字段进行groupby后算count distinct user_id的值,同时开启table.optimizer.distinct-agg.split.enabled=true参数
>job启动之后,查询mysql结果表,发现下一秒的结果有时候会比前一秒的结果小,但是总体趋势是增长的,为什么会出现这种现象呢?
>
>
>&nbsp; &nbsp; &nbsp; &nbsp; 查询时间&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 查询结果
>2020-10-10 19:00:00&nbsp; &nbsp;100
>2020-10-10 19:00:02&nbsp; &nbsp;98
>2020-10-10 19:00:04&nbsp; &nbsp;102
>2020-10-10 19:00:06&nbsp; &nbsp;108
>2020-10-10 19:00:08&nbsp; &nbsp;106
>2020-10-10 19:00:10&nbsp; &nbsp;110
>
>
>sql如下:
>create table jdbc_sink(
>&nbsp; &nbsp; date_str varchar ,
>&nbsp; &nbsp; dau bigint,
>&nbsp; &nbsp; PRIMARY KEY (date_str) NOT ENFORCED
>) with (
>&nbsp; 'connector' = 'jdbc',
>&nbsp; 'url' = 'jdbc:mysql://xxx',
>&nbsp; 'table-name' = 'xxx',
>&nbsp; 'driver' = 'com.mysql.jdbc.Driver',
>&nbsp; 'username' = 'xxx',
>&nbsp; 'password' = 'xxx'
>);
>
>
>CREATE TABLE action_log_source (
>&nbsp; user_id varchar,
>&nbsp; event_time TIMESTAMP(3),
>&nbsp; WATERMARK FOR event_time AS event_time - INTERVAL '10' SECOND
>) with (
>&nbsp; ...
>);
>
>
>INSERT INTO
>&nbsp; jdbc_sink
>SELECT
>&nbsp; day_str as date_str,
>&nbsp; COUNT(DISTINCT user_id) AS dau
>FROM (
>&nbsp; select&nbsp;
>&nbsp; &nbsp; &nbsp; user_id as user_id,
>&nbsp; &nbsp; &nbsp; date_format(event_time, 'yyyy-MM-dd') as day_str
>&nbsp; from action_log_source
>)
>GROUP BY day_str