Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Hello Fabian, I created the JIRA bug https://issues.apache.org/jira/browse/FLINK-9940 BTW, I have one more question: Is it worth to checkpoint that list of processed files? Does the current implementation of file-source guarantee exactly-once? Thanks for your support. -- Sent from:

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Thank you Fabian. I tried to implement a quick test basing on what you suggested: having an offset from system time, and I did get improvement: with offset = 500ms - the problem has completely gone. With offset = 50ms, I still got around 3-5 files missed out of 10,000. This number might come

Best way to find the current alive jobmanager with HA mode zookeeper

2018-07-24 Thread Yuan,Youjun
Hi all, I have a standalone cluster with 3 jobmanagers, and set high-availability to zookeeper. Our client submits job by REST API(POST /jars/:jarid/run), which means we need to know the host of the any of the current alive jobmanagers. The problem is that, how can we know which job manager is

Re: downgrade Flink

2018-07-24 Thread vino yang
Hi Cederic, I just read the project you gave, it includes the following statement in its README file. *“flink-jpmml is tested with the latest Flink (i.e. 1.3.2), but any working Apache Flink version (repo) should work properly.”* This project was born a year ago and should not rely on

Re: Flink 1.5 batch job fails to start

2018-07-24 Thread vino yang
Hi Alex, Is it possible that the data has been corrupted? Or have you confirmed that the avro version is consistent in different Flink versions? Also, if you don't upgrade Flink and still use version 1.3.1, can it be recovered? Thanks, vino. 2018-07-25 8:32 GMT+08:00 Alex Vinnik : > Vino, >

Re: Flink 1.5 batch job fails to start

2018-07-24 Thread Alex Vinnik
Vino, Upgraded flink to Hadoop 2.8.1 $ docker exec -it flink-jobmanager cat /var/log/flink/flink.log | grep entrypoint | grep 'Hadoop version' 2018-07-25T00:19:46.142+ [org.apache.flink.runtime.entrypoint.ClusterEntrypoint] INFO Hadoop version: 2.8.1 but job still fails to start Ideas?

downgrade Flink

2018-07-24 Thread Cederic Bosmans
Dear I am working on a streaming prediction model for which I want to try to use the flink-jpmml extension. (https://github.com/FlinkML/flink-jpmml) Unfortunately, it only supports only the 0.7.0-SNAPSHOT and 0.6.1 versions of Flink and I am using the 1.7-SNAPSHOT version of Flink. How can I

Re: Recommended fat jar excludes?

2018-07-24 Thread Chesnay Schepler
The previous list exclude a number of dependencies to prevent clashes with Flink (for example netty) which is no longer required. If you could provide the output of "mvn dependency:tree" we might be able to figure out why the jar is larger. On 24.07.2018 20:49, jlist9 wrote: We started out

Recommended fat jar excludes?

2018-07-24 Thread jlist9
We started out with a sample project from an earlier version of flink-java. The sample project's pom.xml contained a long list of elements for building the fat jar. The fat jar size is slightly over 100MB in our case. We are looking to upgrade to Flink 1.5 so we updated the pom.xml using one

Avro writer has already been opened

2018-07-24 Thread Chengzhi Zhao
Hi, there, I am using avro format and write data to S3, recently upgraded from Flink 1.3.2 to 1.5 and noticed the following errors as below: I am using RocksDB and checkpointDataUri is an S3 location. My writer looks like something below. val writer = new AvroKeyValueSinkWriter[String,

Re: Implement Joins with Lookup Data

2018-07-24 Thread ashish pok
App is checkpointing, so will pick up if an operation fails. I suppose you mean a TM completely crashes and even in that case another TM would spin up and it “should” pick up from checkpoint. We are running YARN but I would assume TM recovery would be possible in any other cluster. I havent

Re: LoggingFactory: Access to static method across operators

2018-07-24 Thread Till Rohrmann
Hi Jayant, I think you should be able to implement your own StaticLoggerBinder which returns your own LoggerFactory. That is quite similar to how the different logging backends (log4j, logback) integrate with slf4j. Cheers, Till On Tue, Jul 24, 2018 at 5:41 PM Jayant Ameta wrote: > I am using

Re: Implement Joins with Lookup Data

2018-07-24 Thread Harshvardhan Agrawal
What happens when one of your workers dies? Say the machine is dead is not recoverable. How do you recover from that situation? Will the pipeline die and you go over the entire bootstrap process? On Tue, Jul 24, 2018 at 11:56 ashish pok wrote: > BTW, > > We got around bootstrap problem for

Re: Implement Joins with Lookup Data

2018-07-24 Thread ashish pok
BTW,  We got around bootstrap problem for similar use case using a “nohup” topic as input stream. Our CICD pipeline currently passes an initialize option to app IF there is a need to bootstrap and waits for X minutes before taking a savepoint and restart app normally listening to right

LoggingFactory: Access to static method across operators

2018-07-24 Thread Jayant Ameta
I am using a custom LoggingFactory. Is there a way to provide access to this custom LoggingFactory to all the operators other than adding it to all constructors? This is somewhat related to: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Adding-Context-To-Logs-td7351.html

Re: Implement Joins with Lookup Data

2018-07-24 Thread Elias Levy
Alas, this suffer from the bootstrap problem. At the moment Flink does not allow you to pause a source (the positions), so you can't fully consume the and preload the accounts or products to perform the join before the positions start flowing. Additionally, Flink SQL does not support

Re: Flink 1.5 batch job fails to start

2018-07-24 Thread vino yang
Hi Alex, Based on your log information, the potential reason is Hadoop version. To troubleshoot the exception comes from different Hadoop version. I suggest you match the both side of Hadoop version. You can : 1. Upgrade the Hadoop version which Flink Cluster depends on, Flink's official

Re: Flink 1.5 batch job fails to start

2018-07-24 Thread Alex Vinnik
Hi Till, Thanks for responding. Below is entrypoint logs. One thing I noticed that "Hadoop version: 2.7.3". My fat jar is built with hadoop 2.8 client. Could it be a reason for that error? If so how can i use same hadoop version 2.8 on flink server side? BTW job runs fine locally reading from

Re: Implement Joins with Lookup Data

2018-07-24 Thread Till Rohrmann
Yes, using Kafka which you initialize with the initial values and then feed changes to the Kafka topic from which you consume could be a solution. On Tue, Jul 24, 2018 at 3:58 PM Harshvardhan Agrawal < harshvardhan.ag...@gmail.com> wrote: > Hi Till, > > How would we do the initial hydration of

Re: Implement Joins with Lookup Data

2018-07-24 Thread Harshvardhan Agrawal
Hi Till, How would we do the initial hydration of the Product and Account data since it’s currently in a relational DB? Do we have to copy over data to Kafka and then use them? Regards, Harsh On Tue, Jul 24, 2018 at 09:22 Till Rohrmann wrote: > Hi Harshvardhan, > > I agree with Ankit that

[ANNOUNCE] Weekly community update #30

2018-07-24 Thread Till Rohrmann
Dear community, this is the weekly community update thread #30. Please post any news and updates you want to share with the community to this thread. # First RC for Flink 1.6.0 The community is published the first release candidate for Flink 1.6.0 [1]. Please help the community by trying the RC

Re: Implement Joins with Lookup Data

2018-07-24 Thread Till Rohrmann
Hi Harshvardhan, I agree with Ankit that this problem could actually be solved quite elegantly with Flink's state. If you can ingest the product/account information changes as a stream, you can keep the latest version of it in Flink state by using a co-map function [1, 2]. One input of the co-map

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Fabian Hueske
Hi, The problem is that Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. We use the modification time, to avoid that we have to remember the names of all files that were ever consumed, which would be expensive to check and

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Hello Jörn. Thanks for your help. "/Probably the system is putting them to the folder and Flink is triggered before they are consistent./" <<< yes, I also guess so. However, if Flink is triggered before they are consistent, either (a) there should be some error messages, or (b) Flink should be

Re: Memory Logging

2018-07-24 Thread Till Rohrmann
Hi Oliver, which Flink image are you using? If you are using the docker image from docker hub [1], then the memory logging will go to stdout and not to a log file. The reason for this behavior is that the docker image configures the logger to print to stdout such that one can easily access the

Re: SingleOutputStreamOperator vs DataStream?

2018-07-24 Thread Till Rohrmann
Hi Chris, a `DataStream` represents a stream of events which have the same type. A `SingleOutputStreamOperator` is a subclass of `DataStream` and represents a user defined transformation applied to an input `DataStream` and producing an output `DataStream` (represented by itself). Since you can

Re: Questions on Unbounded number of keys

2018-07-24 Thread Till Rohrmann
Hi Chang Liu, if you are dealing with an unlimited number of keys and keep state around for every key, then your state size will keep growing with the number of keys. If you are using the FileStateBackend which keeps state in memory, you will eventually run into an OutOfMemoryException. One way

Re: streaming predictions

2018-07-24 Thread Andrea Spina
Dear Cederic, I did something similar as yours a while ago along this work [1] but I've always been working within the batch context. I'm also the co-author of flink-jpmml and, since a flink2pmml model saver library doesn't exist currently, I'd suggest you a twofold strategy to tackle this

Re: NoClassDefFoundError when running Twitter Example

2018-07-24 Thread Till Rohrmann
Hi Syed, could you check whether this class is actually contained in the twitter example jar? If not, then you have to build an uber jar containing all required dependencies. Cheers, Till On Tue, Jul 24, 2018 at 5:11 AM syed wrote: > I am facing the *java.lang.NoClassDefFoundError: >

Re: Flink 1.5 batch job fails to start

2018-07-24 Thread Till Rohrmann
Hi Alex, I'm not entirely sure what causes this problem because it is the first time I see it. First question would be if the problem also arises if using a different Hadoop version. Are you using the same Java versions on the client as well as on the server? Could you provide us with the

Re: Implement Joins with Lookup Data

2018-07-24 Thread Harshvardhan Agrawal
Hi, Thanks for your responses. There is no fixed interval for the data being updated. It’s more like whenever you onboard a new product or there are any mandates that change will trigger the reference data to change. It’s not just the enrichment we are doing here. Once we have enriched the data

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html You will find there a passage of the consistency model. Probably the system is putting them to the folder and Flink is triggered before they are consistent. What happens after Flink put s them on S3 ? Are they reused by another

Re: When a jobmanager fails, it doesn't restart because it tries to restart non existing tasks

2018-07-24 Thread Till Rohrmann
Hi Gerard, the first log snippet from the client does not show anything suspicious. The warning just says that you cannot use the Yarn CLI because it lacks the Hadoop dependencies in the classpath. The second snippet is indeed more interesting. If the TaskExecutors are not notified about the

Memory Logging

2018-07-24 Thread Oliver Breit
Hi everyone, We are using a simple Flink setup with one jobmanager and one taskmanager running inside a docker container. We are having issues enabling the *taskmanager.debug.memory.startLogThread *setting. We added *taskmanager.debug.memory.startLogThread: true*

Re: streaming predictions

2018-07-24 Thread David Anderson
One option (which I haven't tried myself) would be to somehow get the model into PMML format, and then use https://github.com/FlinkML/flink-jpmml to score the model. You could either use another machine learning framework to train the model (i.e., a framework that directly supports PMML export),

SingleOutputStreamOperator vs DataStream?

2018-07-24 Thread chrisr123
I'm trying to get a list of late elements in my Tumbling Windows application and I noticed that I need to use SingleOutputStreamOperator instead of DataStream to get access to the .sideOutputLateData(...) method. Can someone explain what the difference is between SingleOutputStreamOperator and

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Could you please help explain more details on "/try read after write consistency (assuming the files are not modified) /"? I guess that the problem I got comes from the inconsistency in S3 files listing. Otherwise, I would have got exceptions on file not found. My use case is to read output

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
Sure kinesis is another way. Can you try read after write consistency (assuming the files are not modified) In any case it looks you would be better suited with a NoSQL store or kinesis (I don’t know your exact use case in order to provide you more details) > On 24. Jul 2018, at 09:51, Averell

Questions on Unbounded number of keys

2018-07-24 Thread Chang Liu
Dear All, I have questions regarding the keys. In general, the questions are: what happens if I am doing keyBy based on unlimited number of keys? How Flink is managing each KeyedStream under the hood? Will I get memory overflow, for example, if every KeyStream associated with a specific key is

Re: Implement Joins with Lookup Data

2018-07-24 Thread Jain, Ankit
How often is the product db updated? Based on that you can store product metadata as state in Flink, maybe setup the state on cluster startup and then update daily etc. Also, just based on this feature, flink doesn’t seem to add a lot of value on top of Kafka. As Jorn said below, you can very

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Just some update: I tried to enable "EMRFS Consistent View" option, but it didn't help. Not sure whether that's what you recommended, or something else. Thanks! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Question regarding State in full outer join

2018-07-24 Thread Fabian Hueske
Hi Darshan, The join implementation in SQL / Table API does what is demanded by the SQL semantics. Hence, what results to emit and also what data to store (state) to compute these results is pretty much given. You can think of the semantics of the join as writing both streams into a relational

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Hi Jörn, Thanks. I had missed that EMRFS strong consistency configuration. Will try that now. We also had a backup solution - using Kinesis instead of S3 (I don't see Kinesis in your suggestion, but hope that it would be alright). "/The small size and high rate is not suitable for S3 or HDFS/"

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
It could be related to S3 that seems to be configured for eventual consistency. Maybe it helps to configure strong consistency. However, I recommend to replace S3 with a NoSQL database (since you are amazon Dynamo would help + Dynamodb streams, alternatively sns or sqs). The small size and

Re: Question regarding State in full outer join

2018-07-24 Thread vino yang
Hi Darshan, In your use case, I think you can implement the outer join with DataStream API ( use State + ProcessFunction + Timer ). Using suitable statue, you can store 1 value per key and do not need to keep all the value's history for every key. And you can refer to Flink's implementation of