Hi Yubin,

Thanks for driving this discussion. Perhaps a specific example can better
illustrate the current issue.

Considering the following DDL, f0 will always be generated with a default
char length of 100, regardless of char(5), bcause the connector option
'fields.f0.length' is not specified [1].

> CREATE TABLE foo (
>    f0 CHAR(5)
> ) WITH ('connector' = 'datagen');
>

Since it's often the case for a fixed-length type to specify length
explictly in the DDL, the current design can be confusing for users to some
extent.

However, for the proposed changes, it would be preferable to provide
specific details on how to handle the "not be user-defined" scenario. For
example, should it be ignored or should an exception be thrown?

To be more specific,
1. For fixed-length data types, what happens for the following two DDLs

> CREATE TABLE foo (
>    f0 CHAR(5)
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>
> CREATE TABLE bar (
>    f0 CHAR(5)
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '1');
>

2. For variable-length data types, what happens for the following two DDLs

> CREATE TABLE meow (
>    f0 VARCHAR(20)
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>
> CREATE TABLE purr (
>    f0 STRING
> ) WITH ('connector' = 'datagen', 'fields.f0.length' = '10');
>

Best,
Jane

[1]
https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/datagen/#fields-length


On Mon, Nov 20, 2023 at 8:46 PM 李宇彬 <lixin58...@163.com> wrote:

> Hi everyone,
>
>
> Currently, the Datagen connector generates data that doesn't match the
> schema definition
> when dealing with fixed-length and variable-length fields. It defaults to
> a unified length of 100
> and requires manual configuration by the user. This violates the
> correctness of schema constraints
> and hampers ease of use.
>
>
> Jane Chan and I have discussed offline and I will summarize our discussion
> below.
>
>
> To enhance the datagen connector to automatically generate data that
> conforms to the schema
> definition without additional manual configuration, we propose handling
> the following data types
> appropriately [1]:
>       1. For fixed-length data types (char, binary), the length should be
> defined by the schema definition
>          and not be user-defined.
>       2. For variable-length data types (varchar, varbinary), the length
> should be defined by the schema
>           definition, but allow for user-defined lengths that are smaller
> than the schema definition.
>
>
>
> Looking forward to your feedback :)
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-32993
>
>
> Best,
> Yubin
>
>

Reply via email to