Building flink from source---error of resolving dependencies

2019-06-13 Thread syed
Hi; I am trying to build Apache Flink from the source. I am strictly following the instructions given @github flink page. https://github.com/apache/flink I am facing the error for resolving the dependencies for project flink-mapr-fs. The details of error is as below; [ERROR] Failed to

Flink end to end intergration test

2019-06-13 Thread min.tan
Hi, I am new to Flink, at least to the testing part. We need an end to end integration test for a flink job. Where can I find documentation for this? I am envisaging a test similar to that: 1) Start a local job instance in an IDE or maven test 2) Fire event jsons to the data

Re: Flink end to end intergration test

2019-06-13 Thread Konstantin Knauf
Hi Min, I recently published a small repository [1] containing examples of how to test Flink applications on different levels of the testing pyramid. It also contains one integration test, which spins up an embedded Flink cluster [2]. In contrast to your requirements this test uses dedicated

java.lang.NoClassDefFoundError --- FlinkKafkaConsumer

2019-06-13 Thread syed
hi I am trying to add a kafka source to the standard WordCount application available with flink, So that the application may count words from the kafka source. I added the source like this *DataStream text; if(params.has("topic")&("bootstrap.servers")&("zookeeper.connect") &("group.id")) {

Re: java.lang.NoClassDefFoundError --- FlinkKafkaConsumer

2019-06-13 Thread Yun Tang
Hi Syed Have you ever build your user application jar package with all dependencies as [1] suggested or copy related connector jar package from flink-home/opt/ folder to flink-home/lib/ folder? [1]

Re: Latency Monitoring in Flink application

2019-06-13 Thread Konstantin Knauf
Hi Roey, with Latency Tracking you will get a distribution of the time it took for LatencyMarkers to travel from each source operator to each downstream operator (per default one histogram per source operator in each non-source operator, see metrics.latency.granularity). LatencyMarkers are

Re: Latency Monitoring in Flink application

2019-06-13 Thread Timothy Victor
Thanks for the insight. I was also interested in this topic. One thought occurred to me is what about the queuing delay when sending to your message bus (e.g. kafka). I am guessing the probe will be before the message is added to the send queue? Thanks again Tim On Thu, Jun 13, 2019, 6:08

Latency Monitoring in Flink application

2019-06-13 Thread Halfon, Roey
Hi All, I'm looking for help regarding latency monitoring. Let's say I have a simple streaming data flow with the following operators: FlinkKafkaConsumer -> Map -> print. In case I want to measure a latency of records processing in my dataflow, what would be the best opportunity? I want to get

Re: What happens when: high-availability.storageDir: is not available?

2019-06-13 Thread John Smith
Thanks. Does this folder need to available for task nodes as well? Or just job nodes? On Wed., Jun. 12, 2019, 11:56 p.m. Yun Tang, wrote: > High availability storage directory would store completed checkpoint and > submitted job graph and completed checkpoint. If this directory is > unavailable

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Felipe Gutierrez
Hi Hequn, indeed the ReduceFunction is better than the ProcessWindowFunction. I replaced and could check the improvement performance [1]. Thanks for that! I will try a distinct count with the Table API. The question that I am facing is that I want to use a HyperLogLog on a UDF for DataStream.

Re: Loading state from sink/understanding recovery options

2019-06-13 Thread Eduardo Winpenny Tejedor
Is it possible someone could comment on this question in either direction please? Thanks, Eduardo On Sat, 8 Jun 2019, 14:10 Eduardo Winpenny Tejedor, < eduardo.winpe...@gmail.com> wrote: > Hi, > > In generic terms, if a keyed operator outputs its state into a sink, and > the state of that

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Felipe Gutierrez
humm.. it seems that it is my turn to implement all this stuff using Table API. Thanks Rong! *--* *-- Felipe Gutierrez* *-- skype: felipe.o.gutierrez* *--* *https://felipeogutierrez.blogspot.com * On Thu, Jun 13, 2019 at 6:00 PM Rong Rong wrote: > Hi

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Rong Rong
Hi Felipe, Hequn is right. The problem you are facing is better using TableAPI level code instead of dealing with in DataStream. You will have more Flink library support to achieve your goal. In addition, Flink TableAPI also support UserDefineAggregateFunction [1] to achieve your hyperLogLog

Re: Flink end to end intergration test

2019-06-13 Thread Theo Diefenthal
Hi Min, In order to run "clean" integration tests with Kafka, I setup a JUnit rule for buidling up kafka (as mentioned by konstantin), but I also use my own KafkaDeserializer (By extending from my custom deserializer for the project) like so public class TestDeserializer extends

Re: Deletion of topic from a pattern in KafkaConsumer if auto-create is true will always recreate the topic

2019-06-13 Thread Vishal Santoshi
I guess, adjusting the pattern ( blacklisting the topic/s ) would work On Thu, Jun 13, 2019 at 3:02 PM Vishal Santoshi wrote: > Given > > >

Deletion of topic from a pattern in KafkaConsumer if auto-create is true will always recreate the topic

2019-06-13 Thread Vishal Santoshi
Given https://github.com/apache/flink/blob/master/flink-connectors/flink-connector-kafka-base/src/main/java/org/apache/flink/streaming/connectors/kafka/FlinkKafkaConsumerBase.java#L471 it seems that if * I have a regex pattern for consuming a bunch of topics * auto create is turned on then

Re: What happens when: high-availability.storageDir: is not available?

2019-06-13 Thread Yun Tang
Except job graph and completed checkpoint, high availability storage directory would also store blob data which would be accessed from both jobmanager and taskmanager nodes, you could refer to [1] to view the BLOB storage architecture. [1]

Re: Source code question - about the logic of calculating network buffer

2019-06-13 Thread 徐涛
Hi Yun, Thanks a lot for the detailed and clear explanation, that is very helpful. Best Henry > 在 2019年6月13日,上午10:32,Yun Gao 写道: > > Hi tao, > > As a whole, `networkBufBytes` is not part of the heap. In fact, it is > allocated from the direct memory. The rough relationship

Re: How can I improve this Flink application for "Distinct Count of elements" in the data stream?

2019-06-13 Thread Hequn Cheng
Hi Felipe, >From your code, I think you want to get the "count distinct" result instead of the "distinct count". They contain a different meaning. To improve the performance, you can replace your DistinctProcessWindowFunction to a DistinctProcessReduceFunction. A ReduceFunction can aggregate the

Re: Loading state from sink/understanding recovery options

2019-06-13 Thread Congxian Qiu
Hi, Eduardo Currently, we can't load state from the outside(there is an ongoing jira[1] to do this), in the other word, if you disable checkpoint, and use the Kafka/database as your state storage, you should do the deduplication things by yourself. Just curious, which state backend do you use,

Re:flink

2019-06-13 Thread 王志明
Hi,你好: 如果这些delay的数据在窗口计算之后才收到,那么会被Flink丢弃,但是你可以借助 allowedlateness 和 Side Output 把这部分delay的数据筛选出来,然后在进行处理。 在 2019-05-13 15:35:33,"Kobeli" 写道: >hello flink watermark的问题 >flink job使用event time聚合指标, source是多分区的kafka,如果一个分区的数据由于某个原因(比如under-replica), >flink