How to custom (or use) a window to specify everyday's beginning as watermark?

2019-11-25 Thread Rock
I need my job to aggregator every device's mertic as daily report.But I did not find a window cancoverexactly one day,or let everyday's beginning as watermark .Should I custom a window or any other way toachieve?

flink on yarn 的 kerberos 认证问题

2019-11-25 Thread venn
各位大佬: 请教一个flink 认证的问题: Flink on yarn 运行在不用认证的 Hadoop 集群上,怎样访问带kerberos 认证集群的 hbase ? 下面是一些我们使用的描述和发现的问题: 我们有两个hadoop 集群,一个使用 Kerberos 认证模式,一个是 simple 认证模式,Flink 1.9.0 部署在 simple 认证的集群上。 最近在使用flink 读取 Kerberos 认证的集群的 hbase 上遇到了问题。配置 flink-conf.yaml

flink on yarn 的 kerberos 认证问题

2019-11-25 Thread venn
各位大佬: 请教一个flink 认证的问题: Flink on yarn 运行在不用认证的 Hadoop 集群上,怎样访问带kerberos 认证集群的 hbase ? 下面是一些我们使用的描述和发现的问题: 我们有两个hadoop 集群,一个使用 Kerberos 认证模式,一个是 simple 认证模式,Flink 1.9.0 部署在 simple 认证的集群上。 最近在使用flink 读取 Kerberos 认证的集群的 hbase 上遇到了问题。配置 flink-conf.yaml

Re: Flink behavior as a slow consumer - out of Heap MEM

2019-11-25 Thread vino yang
Hi Hanan, Sometimes, the behavior depends on your implementation. Since it's not a built-in connector, it would be better to share your customized source with the community so that the community would be better to help you figure out where is the problem. WDYT? Best, Vino Hanan Yehudai

Flink behavior as a slow consumer - out of Heap MEM

2019-11-25 Thread Hanan Yehudai
HI , I am trying to do some performance test to my flink deployment. I am implementing an extremely simplistic use case I built a ZMQ Source The topology is ZMQ Source - > Mapper- > DIscardingSInk ( a sink that does nothing ) Data is pushed via ZMQ at a very high rate. When the incoming rate

Fwd: How to recover state from savepoint on embedded mode?

2019-11-25 Thread Reo Lei
-- Forwarded message - 发件人: Reo Lei Date: 2019年11月26日周二 上午9:53 Subject: Re: How to recover state from savepoint on embedded mode? To: Yun Tang Hi Yun, Thanks for your reply. what I say the embedded mode is the whole flink cluster and job, include jobmanager, taskmanager and the

Re: Apache Flink - Troubleshooting exception while starting the job from savepoint

2019-11-25 Thread M Singh
Hi Kostas/Congxian: Thanks fo your response.   Based on your feedback, I found that I had missed adding uid to one of the stateful operators and correcting that resolved the issue.  I still have stateless operators which I have no uid specified in the application. So, I thought that adding uid

Re: Question about to modify operator state on Flink1.7 with State Processor API

2019-11-25 Thread vino yang
Hi Kaihao, Ping @Aljoscha Krettek @Tzu-Li (Gordon) Tai to give more professional suggestions. What's more, we may need to give a statement about if the state processor API can process the snapshots generated by the old version jobs. WDYT? Best, Vino Kaihao Zhao 于2019年11月25日周一 下午11:39写道: >

Streaming Files to S3

2019-11-25 Thread Li Peng
Hey folks, I'm trying to stream large volume data and write them as csv files to S3, and one of the restrictions is to try and keep the files to below 100MB (compressed) and write one file per minute. I wanted to verify with you guys regarding my understanding of StreamingFileSink: 1. From the

Re: Pre-process data before it hits the Source

2019-11-25 Thread vino yang
Hi Vijay, IMO, the semantics of the source is not changeless. It can contain integrate with third-party systems and consume events. However, it can also contain more business logic about your data pre-process after consuming events. Maybe it needs some customization. WDYT? Best, Vino Vijay

Re: Dynamically creating new Task Managers in YARN

2019-11-25 Thread Piper Piper
Hi Yang, Session mode is working exactly as you described. No exceptions. Thank you! Piper On Sun, Nov 24, 2019 at 11:24 PM Yang Wang wrote: > Hi Piper, > > In session mode, Flink will always use the free slots in the existing > TaskManagers first. > When it can not full fill the slot

Flink 1.9.1 KafkaConnector missing data (1M+ records)

2019-11-25 Thread Harrison Xu
Hello, We're seeing some strange behavior with flink's KafkaConnector010 (Kafka 0.10.1.1) arbitrarily skipping data. *Context* KafkaConnector010 is used as source, and StreamingFileSink/BulkPartWriter (S3) as sink with no intermediate operators. Recently, we noticed that millions of Kafka

Pre-process data before it hits the Source

2019-11-25 Thread Vijay Balakrishnan
Hi, Need to pre-process data(transform incoming data to a different format) before it hits the Source I have defined. How can I do that ? I tried to use a .map on the DataStream but that is too late as the data has already hit the Source I defined. FlinkKinesisConsumer> kinesisConsumer =

Re: Metrics for Task States

2019-11-25 Thread Kelly Smith
Thanks Caizhi, that was what I was afraid of. Thanks for the information on the REST API  It seems like the right solution would be to add it as a first-class feature for Flink so I will add a feature request. I may end up using the REST API as a workaround in the short-term - probably with a

Re: Completed job wasn't saved to archive

2019-11-25 Thread Chesnay Schepler
I'm afraid I can't think of a solution. I don't see a way how this operation can succeed or fail without anything being logged. Is the cluster behaving normally afterwards? Could you check whether the numRunningJobs ticks down properly after the job was canceled? On 22/11/2019 13:27, Pavel

Re: How to recover state from savepoint on embedded mode?

2019-11-25 Thread Yun Tang
What is the embedded mode mean here? If you refer to SQL embedded mode, you cannot resume from savepoint now; if you refer to local standalone cluster, you could use `bin/flink run -s` to resume on a local cluster. Best Yun Tang From: Reo Lei Date: Tuesday, November 26, 2019 at 12:37 AM To:

How to recover state from savepoint on embedded mode?

2019-11-25 Thread Reo Lei
Hi, I have a job need running on embedded mode, but need to init some rule data from a database before start. So I used the State Processor API to construct my state data and save it to the local disk. When I want to used this savepoint to recover my job, I found resume a job from a savepoint need

Re: How to submit two jobs sequentially and view their outputs in .out file?

2019-11-25 Thread Piotr Nowojski
Hi, I would suggest the same thing as Vino did: it might be possible to use stdout somehow, but it’s a better idea to coordinate in some other way. Produce some (side?) output with a control message from one job once it finishes, that will control the second job. Piotrek > On 25 Nov 2019, at

Question about to modify operator state on Flink1.7 with State Processor API

2019-11-25 Thread Kaihao Zhao
Hi, We are running Flink 1.7 and recently due to Kafka cluster migration, we need to find a way to modify kafka offset in FlinkKafkaConnector's state, and we found Flink 1.9's State Processor API is the exactly tool we need, we are able to modify the operator state via State Processor API, but

Re: Apache Flink - Troubleshooting exception while starting the job from savepoint

2019-11-25 Thread Congxian Qiu
Hi The problem is that the specified uid did not in the new job. 1. As far as I know, the answer is yes. There are some operators have their own state(such as window state), could you please share the minimal code of your job? 2.*truely* stateless operator do not need to have uid, but for the

Re: Window-join DataStream (or KeyedStream) with a broadcast stream

2019-11-25 Thread Piotr Nowojski
Hi, So you are trying to use the same window definition, but you want to aggregate the data in two different ways: 1. keyBy(userId) 2. Global aggregation Do you want to use exactly the same aggregation functions? If not, you can just process the events twice: DataStream<…> events = …;

Re: Flink distributed runtime architucture

2019-11-25 Thread Piotr Nowojski
Hi, I’m glad to hear that you are interested in Flink! :) > In the picture, keyBy window and apply operators share the same circle. Is > is because these operators are chaining together? It’s not as much about chaining, as the chain of DataStream API invocations

Re: Apache Flink - Troubleshooting exception while starting the job from savepoint

2019-11-25 Thread Kostas Kloudas
As a side note, I am assuming that you are using the same Flink Job before and after the savepoint and the same Flink version. Am I correct? Cheers, Kostas On Mon, Nov 25, 2019 at 2:40 PM Kostas Kloudas wrote: > > Hi Singh, > > This behaviour is strange. > One thing I can recommend to see if

Re: Apache Flink - Troubleshooting exception while starting the job from savepoint

2019-11-25 Thread Kostas Kloudas
Hi Singh, This behaviour is strange. One thing I can recommend to see if the two jobs are identical is to launch also the second job without a savepoint, just start from scratch, and simply look at the web interface to see if everything is there. Also could you please provide some code from your

Re: Per Operator State Monitoring

2019-11-25 Thread Piotr Nowojski
Hi, I’m not sure if there is some simple way of doing that (maybe some other contributors will know more). There are two potential ideas worth exploring: - use periodically triggered save points for monitoring? If I remember correctly save points are never incremental - use save point

Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

2019-11-25 Thread Piotr Nowojski
Hi, Good to hear that you were able to workaround the problem. I’m not sure what’s the exact reason why mmaped partitions caused those failures, but you are probably right that they have caused some memory exhaustion. Probably this memory is not capped by anything, but I would expect kernel

Re: DataSet API: HBase ScannerTimeoutException and double Result processing

2019-11-25 Thread OpenInx
> If the call to mapResultToOutType(Result) finished without an error there is no need to restart from the same row. > The new scanner should start from the next row. > Is that so or am I missing something? Yeah, your are right. I've filed the issue

Re: Apache Flink - Uid and name for Flink sources and sinks

2019-11-25 Thread M Singh
Thanks DIan for your pointers.  MansOn Sunday, November 24, 2019, 08:57:53 PM EST, Dian Fu wrote: Hi Mans, Please see my reply inline below. 在 2019年11月25日,上午5:42,M Singh 写道: Thanks Dian for your answers. A few more questions: 1. If I do not assign uids to operators/sources and

Re: Apache Flink - Throttling stream flow

2019-11-25 Thread M Singh
Thanks Ciazhi & Thomas for your responses. I read the throttling example but want to see if that work with a distributed broker like Kinesis and how to have throttling feedback to the Kinesis source so that it can vary the rate without interfering with watermarks, etc. Thanks again  Mans

Apache Flink - High Available K8 Setup Wrongly Marked Dead TaskManager

2019-11-25 Thread Eray Arslan
Hi, I have some trouble with my HA K8 cluster. Current my Flink application has infinite stream. (With 12 parallelism) After few days I am losing my task managers. And they never reconnect to job manager. Because of this, application cannot get restored with restart policy. I did few searches

Re: flink session cluster ha on k8s

2019-11-25 Thread Yun Tang
Currently, you still need zookeeper service to enable HA on k8s, and the configuration for this part is no different from YARN mode [1]. By the way, there also exists other solution to implement HA like etcd [2], but still in discussion. [1]

Re: DataSet API: HBase ScannerTimeoutException and double Result processing

2019-11-25 Thread Mark Davis
Hi Flavio, >> When the resultScanner dies because of a timeout (this happens a lot when >> you have backpressure and the time between 2 consecutive reads exceed the >> scanner timeout), the code creates a new scanner and restart from where it >> was (starRow = currentRow). >> So there should

Re: Apache Flink - Throttling stream flow

2019-11-25 Thread Thomas Julian
related https://issues.apache.org/jira/browse/FLINK-13792 Regards, Julian. On Mon, 25 Nov 2019 15:25:14 +0530 Caizhi Weng wrote Hi, As far as I know, Flink currently doesn't have a built-in throttling function. You can write your own user-defined function to

Re: Flink Kudu Connector

2019-11-25 Thread vino yang
Hi Rahul, Only found some resources from the Internet you can consider.[1][2] Best, Vino [1]: https://bahir.apache.org/docs/flink/current/flink-streaming-kudu/ [2]: https://www.slideshare.net/0xnacho/apache-flink-kudu-a-connector-to-develop-kappa-architectures Rahul Jain 于2019年11月25日周一

Flink Kudu Connector

2019-11-25 Thread Rahul Jain
Hi, We are trying to use the Flink Kudu connector. Is there any documentation available that we can read to understand how to use it ? We found some sample code but that was not very helpful. Thanks, -rahul

Re: Apache Flink - Throttling stream flow

2019-11-25 Thread Caizhi Weng
Hi, As far as I know, Flink currently doesn't have a built-in throttling function. You can write your own user-defined function to achieve this. Your function just gives out what it reads in and limits the speed it gives out records at the same time. If you're not familiar with user-defined

Re: Idiomatic way to split pipeline

2019-11-25 Thread Avi Levi
Thanks, I'll check it out. On Mon, Nov 25, 2019 at 11:46 AM vino yang wrote: > *This Message originated outside your organization.* > -- > Hi Avi, > > The side output provides a superset of split's functionality. So anything > can be implemented via split also can be

Re: Idiomatic way to split pipeline

2019-11-25 Thread vino yang
Hi Avi, The side output provides a superset of split's functionality. So anything can be implemented via split also can be implemented via side output.[1] Best, Vino [1]: https://stackoverflow.com/questions/51440677/apache-flink-whats-the-difference-between-side-outputs-and-split-in-the-data

Re: Metrics for Task States

2019-11-25 Thread Caizhi Weng
Hi Kelly, As far as I know Flink currently does not have such metrics to monitor on the number of tasks in each states. See https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html for the complete metrics list. (It seems that `taskSlotsAvailable` in the metrics list is the

Re: DataSet API: HBase ScannerTimeoutException and double Result processing

2019-11-25 Thread Flavio Pompermaier
Maybe the problem is indeed this..the fact that the scan starts from the last seen row..in this case maybe the first result should be skipped because it was already read.. On Mon, Nov 25, 2019 at 10:22 AM Flavio Pompermaier wrote: > What I can tell is how the HBase input format works..if you

Re: Idiomatic way to split pipeline

2019-11-25 Thread Avi Levi
Thank you, for your quick reply. I appreciate that. but this it not exactly "side output" per se. it is simple splitting. IIUC The side output is more for splitting the records buy something the differentiate them (latnes , value etc' ) . I thought there is more idiomatic but if this is it, than

Re: DataSet API: HBase ScannerTimeoutException and double Result processing

2019-11-25 Thread Flavio Pompermaier
What I can tell is how the HBase input format works..if you look at AbstractTableInputFormat [1] this is the nextRecord() function: public T nextRecord(T reuse) throws IOException { if (resultScanner == null) { throw new IOException("No table result

Re: Idiomatic way to split pipeline

2019-11-25 Thread vino yang
Hi Avi, As the doc of DataStream#split said, you can use the "side output" feature to replace it.[1] [1]: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html Best, Vino Avi Levi 于2019年11月25日周一 下午4:12写道: > Hi, > I want to split the output of one of the operators

Re: How to submit two jobs sequentially and view their outputs in .out file?

2019-11-25 Thread vino yang
Hi Komal, > Thank you! That's exactly what's happening. Is there any way to force it write to a specific .out of a TaskManager? No, I am curious why the two jobs depend on stdout? Can we introduce another coordinator other than stdout? IMO, this mechanism is not always available. Best, Vino

Idiomatic way to split pipeline

2019-11-25 Thread Avi Levi
Hi, I want to split the output of one of the operators to two pipelines. Since the *split* method is deprecated, what is the idiomatic way to do that without duplicating the operator ? [image: Screen Shot 2019-11-25 at 10.05.38.png]