How to enable RocksDB native metrics?

2024-04-06 Thread Lei Wang
Using big state and want to do some performance tuning, how can i enable
RocksDB native metrics?

I  am using  Flink 1.14.4

Thanks,
Lei


Re: Combining multiple stages into a multi-stage processing pipeline

2024-04-06 Thread Yunfeng Zhou
Hi Mark,

IMHO, your design of the Flink application is generally feasible. In
Flink ML, I have once met a similar design in ChiSqTest operator,
where the input data is first aggregated to generate some results and
then broadcast and connected with other result streams from the same
input afterwards. You may refer to this algorithm for more details
when designing your applications.
https://github.com/apache/flink-ml/blob/master/flink-ml-lib/src/main/java/org/apache/flink/ml/stats/chisqtest/ChiSqTest.java

Besides, side outputs are typically used when you want to split an
output stream into different categories. Given that the
ProcessWindowFn before each SideOutput-x only has one downstream, it
would be enough to directly pass the resulting DataStream to session
windows instead of introducing side outputs.

Best,
Yunfeng

On Sun, Apr 7, 2024 at 12:41 AM Mark Petronic  wrote:
>
> I am looking for some design advice for a new Flink application and I am 
> relatively new to Flink - I have one, fairly straightforward Flink 
> application in production so far.
>
> For this new application, I want to create a three-stage processing pipeline. 
> Functionally, I am seeing this as ONE long datastream. But, I have to 
> evaluate the STAGE-1 data in a special manner to then pass on that evaluation 
> to STAGE-2 where it will do its own special evaluation using the STAGE-1 
> evaluation results to shape its evaluation. The same thing happens again in 
> STAGE-3, using the STAGE-2 evaluation results. Finally, the end result is 
> published to Kafka. The stages functionally look like this:
>
> STAGE-1
> KafkaSource |=> Keyby => TumblingWindows1 => ProcessWindowFn => SideOutput-1 
> |=> SessionWindow1 => ProcessWindowFn => (SideOutput-2[WindowRecords], 
> KafkaSink[EvalResult])
> |=> WindowAll => ProcessWindowFn => SideOutput-1 ^
>
> STAGE-2
> SideOutput-2 => Keyby => TumblingWindows2 => ProcessWindowFn => SideOutput-3 
> => SessionWindow2 => ProcessWindowFn => (SideOutput-4[WindowRecords], 
> KafkaSink[EvalResult])
>
> STAGE-3
> SideOutput-4 => Keyby => TumblingWindows3 => ProcessWindowFn => SideOutput-5 
> => SessionWindow3 => ProcessWindowFn => KafkaSink
>
> DESCRIPTION
>
> In STAGE-1, there are a fixed number of known keys so I will only see at most 
> about 21 distinct keys and therefore up to 21 tumbling one-minute windows. I 
> also need to aggregate all data in a global window to get an overall 
> non-keyed result. I need to bring the 21 results from those 21 tumbling 
> windows AND the one global result into one place where I can compare each of 
> the 21 windows results to the one global result. Based on this evaluation, 
> only some of the 21 windows results will survive that test. I want to then 
> take the data records from those, say 3 surviving windows, and make them the 
> "source" for STAGE-2 processing as well as publish some intermediate 
> evaluation results to a KafkaSink. STAGE-2 will reprocess the same data 
> records that the three STAGE-1 surviving windows processed, only keying them 
> by different dimensions. I expect there to be around 4000 fairly small 
> records per each of the 21 STAGE-1 windows so, in this example, I would be 
> sending 4000 x 3 = 12000 records in SideOutput-2 to form the new "source" 
> datastream for STAGE-2.
>
> Where I am struggling is:
>
> Trying to figure out how to best connect the output of the 21 STAGE-1 windows 
> and the one WIndowAll window records into a single point (I propose 
> SessionWindow1) to be able to compare each of the 21 windows data results 
> with the WindowAll non-keyed results.
> The best way to connect together these multiple stages.
>
> Looking at the STAGE-1 approach illustrated above, this is my attempt at an 
> approach using side outputs to:
>
> Form a new "source" data stream that contains the outputs of each of the 21 
> windows and the WindowAll data
> Consume that into a single session window
> Do the evaluations between the 21 keyed windows against the overall WindowAll 
> data
> Then emit only the 3 surviving sets of data from the 3 tumbling windows 
> outputs from the ProcessWindowFn to SideOutput-2 and the evaluation results 
> to Kafka
> Finally, SideOutput-2 will then form the new data stream "source" for STAGE-2 
> where a similar process will repeat, passing data to a STAGE-3, again similar 
> processing, to finally obtain the desired result that will be published to 
> Kafka.
>
> I would greatly appreciate the following:
>
> Comments on if this is a valid approach - am I on the right track here?
> Could you suggest an alternate approach that I could investigate if this is 
> problematic?.
>
> I am trying to build a Flink application that follows intended best practices 
> so I am just looking for some confirmation that I am heading down a 
> reasonable path for this design.
>
> Thank you in advance,
> Mark
>


Re: HBase SQL连接器为啥不支持ARRAY/MAP/ROW类型

2024-04-06 Thread Yunfeng Zhou
应该是由于这些复杂集合在HBase中没有一个直接与之对应的数据类型,所以Flink SQL没有直接支持的。

一种思路是把这些数据类型按照某种格式(比如json)转换成字符串/序列化成byte array,把字符串存到HBase中,读取出来的时候也再解析/反序列化。

On Mon, Apr 1, 2024 at 7:38 PM 王广邦  wrote:
>
> HBase SQL 连接器(flink-connector-hbase_2.11) 为啥不支持数据类型:ARRAY、MAP / MULTISET、ROW 
> 不支持?
> https://nightlies.apache.org/flink/flink-docs-release-1.11/zh/dev/table/connectors/hbase.html
> 另外这3种类型的需求处理思路是什么?
>
>
>
>
> 发自我的iPhone


Combining multiple stages into a multi-stage processing pipeline

2024-04-06 Thread Mark Petronic
I am looking for some design advice for a new Flink application and I am
relatively new to Flink - I have one, fairly straightforward Flink
application in production so far.

For this new application, I want to create a three-stage processing
pipeline. Functionally, I am seeing this as ONE long datastream. But, I
have to evaluate the STAGE-1 data in a special manner to then pass on that
evaluation to STAGE-2 where it will do its own special evaluation using the
STAGE-1 evaluation results to shape its evaluation. The same thing happens
again in STAGE-3, using the STAGE-2 evaluation results. Finally, the end
result is published to Kafka. The stages functionally look like this:

STAGE-1
KafkaSource |=> Keyby => TumblingWindows1 => ProcessWindowFn =>
SideOutput-1 |=> SessionWindow1 => ProcessWindowFn =>
(SideOutput-2[WindowRecords], KafkaSink[EvalResult])
|=> WindowAll => ProcessWindowFn =>
SideOutput-1 ^

STAGE-2
SideOutput-2 => Keyby => TumblingWindows2 => ProcessWindowFn =>
SideOutput-3 => SessionWindow2 => ProcessWindowFn =>
(SideOutput-4[WindowRecords], KafkaSink[EvalResult])

STAGE-3
SideOutput-4 => Keyby => TumblingWindows3 => ProcessWindowFn =>
SideOutput-5 => SessionWindow3 => ProcessWindowFn => KafkaSink

DESCRIPTION

In STAGE-1, there are a fixed number of known keys so I will only see at
most about 21 distinct keys and therefore up to 21 tumbling one-minute
windows. I also need to aggregate all data in a global window to get
an overall non-keyed result. I need to bring the 21 results from those 21
tumbling windows AND the one global result into one place where I can
compare each of the 21 windows results to the one global result. Based on
this evaluation, only some of the 21 windows results will survive that
test. I want to then take the data records from those, say 3 surviving
windows, and make them the "source" for STAGE-2 processing as well as
publish some intermediate evaluation results to a KafkaSink. STAGE-2 will
reprocess the same data records that the three STAGE-1 surviving windows
processed, only keying them by different dimensions. I expect there to be
around 4000 fairly small records per each of the 21 STAGE-1 windows so, in
this example, I would be sending 4000 x 3 = 12000 records in SideOutput-2
to form the new "source" datastream for STAGE-2.

Where I am struggling is:

   1. Trying to figure out how to best connect the output of the 21 STAGE-1
   windows and the one WIndowAll window records into a single point (I propose
   SessionWindow1) to be able to compare each of the 21 windows data results
   with the WindowAll non-keyed results.
   2. The best way to connect together these multiple stages.

Looking at the STAGE-1 approach illustrated above, this is my attempt at an
approach using side outputs to:

   1. Form a new "source" data stream that contains the outputs of each of
   the 21 windows and the WindowAll data
   2. Consume that into a single session window
   3. Do the evaluations between the 21 keyed windows against the overall
   WindowAll data
   4. Then emit only the 3 surviving sets of data from the 3 tumbling
   windows outputs from the ProcessWindowFn to SideOutput-2 and the
   evaluation results to Kafka
   5. Finally, SideOutput-2 will then form the new data stream "source" for
   STAGE-2 where a similar process will repeat, passing data to a STAGE-3,
   again similar processing, to finally obtain the desired result that will be
   published to Kafka.

I would greatly appreciate the following:

   1. Comments on if this is a valid approach - am I on the right track
   here?
   2. Could you suggest an alternate approach that I could investigate if
   this is problematic?.

I am trying to build a Flink application that follows intended best
practices so I am just looking for some confirmation that I am heading down
a reasonable path for this design.

Thank you in advance,
Mark