[DISCUSS] FLIP-115: Filesystem connector in Table

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-115: Filesystem connector in Table

Jingsong Li
Hi everyone,

I'd like to start a discussion about FLIP-115 Filesystem connector in Table
[1].
This FLIP will bring:
- Introduce Filesystem table factory in table, support
csv/parquet/orc/json/avro formats.
- Introduce streaming filesystem/hive sink in table

CC to user mail list, if you have any unmet needs, please feel free to
reply~

Look forward to hearing from you.

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table

Best,
Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-115: Filesystem connector in Table

wang jinhai
Thanks for FLIP-115. It is really useful feature for platform developers
who manage hundreds of Flink to Hive jobs in production.
I think we need add 'connector.sink.username' for UserGroupInformation when
data is written to HDFS

Jingsong Li <[hidden email]> 于2020年3月13日周五 下午3:33写道:

> Hi everyone,
>
> I'd like to start a discussion about FLIP-115 Filesystem connector in Table
> [1].
> This FLIP will bring:
> - Introduce Filesystem table factory in table, support
> csv/parquet/orc/json/avro formats.
> - Introduce streaming filesystem/hive sink in table
>
> CC to user mail list, if you have any unmet needs, please feel free to
> reply~
>
> Look forward to hearing from you.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table
>
> Best,
> Jingsong Lee
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-115: Filesystem connector in Table

wang jinhai
In reply to this post by Jingsong Li
Thanks for FLIP-115. It is really useful feature for platform developers who manage hundreds of Flink to Hive jobs in production.
I think we need add 'connector.sink.username' for UserGroupInformation when data is written to HDFS


在 2020/3/13 下午3:33,“Jingsong Li”<[hidden email]> 写入:

    Hi everyone,
   
    I'd like to start a discussion about FLIP-115 Filesystem connector in Table
    [1].
    This FLIP will bring:
    - Introduce Filesystem table factory in table, support
    csv/parquet/orc/json/avro formats.
    - Introduce streaming filesystem/hive sink in table
   
    CC to user mail list, if you have any unmet needs, please feel free to
    reply~
   
    Look forward to hearing from you.
   
    [1]
    https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table
   
    Best,
    Jingsong Lee
   


Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-115: Filesystem connector in Table

Yun Gao
       Hi,
       Very thanks for Jinsong to bring up this discussion! It should largely improve the usability after enhancing the FileSystem connector in Table.

       I have the same question with Piotr. From my side, I think it should be better to be able to reuse existing StreamingFileSink. I think We have began
       enhancing the supported FileFormat (e.g., ORC, Avro...), and reusing StreamFileSink should be able to avoid repeat work in the Table library. Besides,
       the bucket concept seems also matches the semantics of partition.

       For the notification of adding partitions, I'm a little wondering that the Watermark mechanism might not be enough since Bucket/Partition might spans
       multiple subtasks. It depends on the level of notification: if we want to notify for the bucket on each subtask, using watermark to notifying each subtask
       should be ok, but if we want to notifying for the whole Bucket/Partition, we might need to also do some coordination between subtasks.


     Best,
      Yun




------------------------------------------------------------------
From:Piotr Nowojski <[hidden email]>
Send Time:2020 Mar. 13 (Fri.) 18:03
To:dev <[hidden email]>
Cc:user <[hidden email]>; user-zh <[hidden email]>
Subject:Re: [DISCUSS] FLIP-115: Filesystem connector in Table

Hi,

Which actual sinks/sources are you planning to use in this feature? Is it about exposing StreamingFileSink in the Table API? Or do you want to implement new Sinks/Sources?

Piotrek

> On 13 Mar 2020, at 10:04, jinhai wang <[hidden email]> wrote:
>
> Thanks for FLIP-115. It is really useful feature for platform developers who manage hundreds of Flink to Hive jobs in production.
> I think we need add 'connector.sink.username' for UserGroupInformation when data is written to HDFS
>
>
>  在 2020/3/13 下午3:33,“Jingsong Li”<[hidden email]> 写入:
>
>    Hi everyone,
>
>    I'd like to start a discussion about FLIP-115 Filesystem connector in Table
>    [1].
>    This FLIP will bring:
>    - Introduce Filesystem table factory in table, support
>    csv/parquet/orc/json/avro formats.
>    - Introduce streaming filesystem/hive sink in table
>
>    CC to user mail list, if you have any unmet needs, please feel free to
>    reply~
>
>    Look forward to hearing from you.
>
>    [1]
>    https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table
>
>    Best,
>    Jingsong Lee
>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-115: Filesystem connector in Table

Jingsong Li
Thanks Piotr and Yun for involving.

Hi Piotr and Yun, for implementation,

FLINK-14254 [1] introduce batch sink table world, it deals with partitions
thing, metastore thing and etc.. And it just reuse Dataset/Datastream
FileInputFormat and FileOutputFormat. Filesystem can not do without
FileInputFormat, because it need deal with file things, split things. Like
orc and parquet, they need read whole file and have different split logic.

So back to file system connector:
- It needs introducing FilesystemTableFactory, FilesystemTableSource and
FilesystemTableSink.
- For sources, reusing Dataset/Datastream FileInputFormats, there are no
other interface to finish file reading.

For file sinks:
- Batch sink use FLINK-14254
- Streaming sink has two ways.

First way is reusing Batch sink in FLINK-14254, It has handled the
partition and metastore logic well.
- unify batch and streaming
- Using FileOutputFormat is consistent with FileInputFormat.
- Add exactly-once related logic. Just 200+ lines code.
- It's natural to support more table features, like partition commit, auto
compact and etc..

Second way is reusing Datastream StreamingFileSink:
- unify streaming sink between table and Datastream.
- It maybe hard to introduce table related features to StreamingFileSink.

I prefer the first way a little. What do you think?

Hi Yun,

> Watermark mechanism might not be enough.

Watermarks of subtasks are the same in the "snapshotState".

> we might need to also do some coordination between subtasks.

Yes, JobMaster is the role to control subtasks. Metastore is a very fragile
single point, which can not be accessed by distributed, so it is uniformly
accessed by JobMaster.

[1]https://issues.apache.org/jira/browse/FLINK-14254

Best,
Jingsong Lee

On Fri, Mar 13, 2020 at 6:43 PM Yun Gao <[hidden email]> wrote:

>        Hi,
>
>        Very thanks for Jinsong to bring up this discussion! It should
> largely improve the usability after enhancing the FileSystem connector in
> Table.
>
>        I have the same question with Piotr. From my side, I think it
> should be better to be able to reuse existing StreamingFileSink. I think We
> have began
>        enhancing the supported FileFormat (e.g., ORC, Avro...), and
> reusing StreamFileSink should be able to avoid repeat work in the Table
> library. Besides,
>        the bucket concept seems also matches the semantics of partition.
>
>        For the notification of adding partitions, I'm a little wondering
> that the Watermark mechanism might not be enough since Bucket/Partition
> might spans
>        multiple subtasks. It depends on the level of notification: if we
> want to notify for the bucket on each subtask, using watermark to notifying
> each subtask
>        should be ok, but if we want to notifying for the whole
> Bucket/Partition, we might need to also do some coordination between
> subtasks.
>
>
>      Best,
>       Yun
>
>
>
> ------------------------------------------------------------------
> From:Piotr Nowojski <[hidden email]>
> Send Time:2020 Mar. 13 (Fri.) 18:03
> To:dev <[hidden email]>
> Cc:user <[hidden email]>; user-zh <[hidden email]>
> Subject:Re: [DISCUSS] FLIP-115: Filesystem connector in Table
>
> Hi,
>
>
> Which actual sinks/sources are you planning to use in this feature? Is it about exposing StreamingFileSink in the Table API? Or do you want to implement new Sinks/Sources?
>
> Piotrek
>
> > On 13 Mar 2020, at 10:04, jinhai wang <[hidden email]> wrote:
> >
>
> > Thanks for FLIP-115. It is really useful feature for platform developers who manage hundreds of Flink to Hive jobs in production.
>
> > I think we need add 'connector.sink.username' for UserGroupInformation when data is written to HDFS
> >
> >
> >  在 2020/3/13 下午3:33,“Jingsong Li”<[hidden email]> 写入:
> >
> >    Hi everyone,
> >
>
> >    I'd like to start a discussion about FLIP-115 Filesystem connector in Table
> >    [1].
> >    This FLIP will bring:
> >    - Introduce Filesystem table factory in table, support
> >    csv/parquet/orc/json/avro formats.
> >    - Introduce streaming filesystem/hive sink in table
> >
> >    CC to user mail list, if you have any unmet needs, please feel free to
> >    reply~
> >
> >    Look forward to hearing from you.
> >
> >    [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-115%3A+Filesystem+connector+in+Table
> >
> >    Best,
> >    Jingsong Lee
> >
> >
> >
>
>
>

--
Best, Jingsong Lee