Re: Checkpointing & File stream with

2019-06-17 Thread Yun Tang
Hi Sung How about using FileProcessingMode.PROCESS_CONTINUOUSLY [1] as watch type when reading data from HDFS. FileProcessingMode.PROCESS_CONTINUOUSLY would periodically monitor the source while default FileProcessingMode.PROCESS_ONCE would only process once the data and exit. [1] https://ci.

Checkpointing & File stream with

2019-06-17 Thread Sung Gon Yi
Hello, I work on joining two streams, one is from Kafka and another is from a file (small size). Stream processing works well, but checkpointing is failed with following message. The file only has less than 100 lines and the pipeline related file reading is finished with “FINISHED’ o as soon as

Re: Need for user class path accessibility on all nodes

2019-06-17 Thread Biao Liu
Ah, sorry for misunderstanding. So what you are asking is that why we need "--classpath"? I'm not sure what the original author think of it. I guess the listed below might be considered. 1. Avoid duplicated deploying. If some common jars are deployed in advance to each node of cluster, the jobs dep

Re: Need for user class path accessibility on all nodes

2019-06-17 Thread Abdul Qadeer
Hi Biao, I am aware of it - that's not my question. On Mon, Jun 17, 2019 at 7:42 PM Biao Liu wrote: > Hi Abdul, "--classpath " can be used for those are not included in > user jar. If all your classes are included in your jar passed to Flink, you > don't need this "--classpath". > > Abdul Qadee

Re: Need for user class path accessibility on all nodes

2019-06-17 Thread Biao Liu
Hi Abdul, "--classpath " can be used for those are not included in user jar. If all your classes are included in your jar passed to Flink, you don't need this "--classpath". Abdul Qadeer 于2019年6月18日周二 上午3:08写道: > Hi! > > I was going through submission of a Flink program through CLI. I see that >

Need for user class path accessibility on all nodes

2019-06-17 Thread Abdul Qadeer
Hi! I was going through submission of a Flink program through CLI. I see that "--classpath " needs to be accessible from all nodes in the cluster as per documentation. As I understand the jar files are already part of the blob uploaded to JobManager from the CLI. The TaskManagers can download this

Re: Calculate a 4 hour time window for sum,count with sum and count output every 5 mins

2019-06-17 Thread Rafi Aroch
Hi Vijay, When using windows, you may use the 'trigger' to set a Custom Trigger which would trigger your *ProcessWindowFunction* accordingly. In your case, you would probably use: > *.trigger(ContinuousProcessingTimeTrigger.of(Time.minutes(5)))* > Thanks, Rafi On Mon, Jun 17, 2019 at 9:01 PM

Re: Calculate a 4 hour time window for sum,count with sum and count output every 5 mins

2019-06-17 Thread Vijay Balakrishnan
I am also implementing the ProcessWindowFunction and accessing the windowState to get data but how do i push data out every 5 mins during a 4 hr time window ?? I am adding a globalState to handle the 4 hr window ??? Or should I still use the context.windowState even for the 4 hr window ? public c

InvalidProgramException in streaming execution

2019-06-17 Thread Akash Jain
Hi All, I am trying a simple example which looks like this: StreamExecutionEnvironment see = StreamExecutionEnvironment.createLocalEnvironment(); PojoCsvInputFormat pcif = new PojoCsvInputFormat<>(inPath, PojoTypeInfo) TypeExtractor.createTypeInfo(TestPojo.class)); DataStreamSource stream = see.re

Re: [ANNOUNCEMENT] March 2019 Bay Area Apache Flink Meetup

2019-06-17 Thread Xuefu Zhang
Hi all, The scheduled meetup is only about a week away. Please note that RSVP at meetup.com is required. In order for us to get the actual headcount to prepare for the event, please sign up as soon as possible if you plan to join. Thank you very much for your cooperation. Regards, Xuefu On Thu,

Re: NotSerializableException: DropwizardHistogramWrapper inside AggregateFunction

2019-06-17 Thread Vijay Balakrishnan
Thanks,Fabian. I got around the issue by moving the logic for the DropwizardHistogramWrapper -a non serializable class into the ProcessWindowFunction's open() function. On Fri, Jun 7, 2019 at 12:33 AM Fabian Hueske wrote: > Hi, > > There are two ways: > > 1. make the non-serializable member va

Calculate a 4 hour time window for sum,count with sum and count output every 5 mins

2019-06-17 Thread Vijay Balakrishnan
Hi, Need to calculate a 4 hour time window for count, sum with current calculated results being output every 5 mins. How do i do that ? Currently, I calculate results for 5 sec and 5 min time windows fine on the KeyedStream. Time timeWindow = getTimeWindowFromInterval(interval);//eg: timeWindow =

Re: Apache Flink Sql - How to access EXPR$0, EXPR$1 values from a Row in a table

2019-06-17 Thread Rong Rong
Hi Mans, I am not sure if you intended to name them like this. but if you were to access them you need to escape them like `EXPR$0` [1]. Also I think Flink defaults unnamed fields in a row to `f0`, `f1`, ... [2]. so you might be able to access them like that. -- Rong [1] https://calcite.apache.o

Re: How to restart/recover on reboot?

2019-06-17 Thread John Smith
Well some reasons, machine reboots/maintenance etc... Host/VM crashes and restarts. And same goes for the job manager. I don't want/need to have to document/remember some start process for sys admins/devops. So far I have looked at ./start-cluster.sh and all it seems to do is SSH into all the spec

Why the current time of my window never reaches the end time when I use KeyedProcessFunction?

2019-06-17 Thread Felipe Gutierrez
Hi, I used this example of KeyedProcessFunction from the FLink website [1] and I have implemented my own KeyedProcessFunction to process some approximation counting [2]. This worked very well. Then I switched the data source to consume strings from Twitter [3]. The data source is consuming the str

Re: How to restart/recover on reboot?

2019-06-17 Thread Till Rohrmann
Hi John, I have not much experience wrt setting Flink up via systemd services. Why do you want to do it like that? 1. In standalone mode, Flink won't automatically restart TaskManagers. This only works on Yarn and Mesos atm. 2. In case of a lost TaskManager, you should run `taskmanager.sh start`.

Re: Best practice to process DB stored log (is Flink the right choice?)

2019-06-17 Thread Piotr Nowojski
Hi, Those are good questions. > A datastream to connect to a table is available? I What table, what database system do you mean? You can check the list of existing connectors provided by Flink in the documentation. About reading from relational DB (example by using JDBC) you can read a little

Has Flink a kafka processing location strategy?

2019-06-17 Thread Theo Diefenthal
Hi, We have a Hadoop/YARN Cluster with Kafka and Flink/YARN running on the same machines. In Spark (Streaming), there is a PreferBrokers location strategy, so that the executors consume those kafka partitions which are served from the same machines kafka broker. ( https://spark.apache.org/

Re: Timeout about local test case

2019-06-17 Thread Piotr Nowojski
Hi, I don’t know what’s the reason (also there are no open issues with this test in our jira). This test seems to be working on travis/CI and it works for me when I’ve just tried running it locally. There might be some bug in the test/production that is triggered only in some specific condition

Re: How to trigger the window function even there's no message input in this window?

2019-06-17 Thread Piotr Nowojski
Hi, As far as I know, this is currently impossible. You can workaround this issue by maybe implementing your own custom post processing operator/flatMap function, that would: - track the output of window operator - register processing time timer with some desired timeout - every time the process

Re: Flink state: complex value state pojos vs explicitly managed fields

2019-06-17 Thread Congxian Qiu
Hi, If you use RocksDBStateBackend, one member one state will get better performance. Because RocksDBStateBackend needs to de/serialize the key/value when put/get, with one POJO value, you need to de/serializer the whole POJO value when put/get. Best, Congxian Timothy Victor 于2019年6月17日周一 下午8:0

Re: Flink state: complex value state pojos vs explicitly managed fields

2019-06-17 Thread Timothy Victor
I would choose encapsulation if it the fields are indeed related and makes sense for your model. In general, I feel it is not a good thing to let Flink (or any other frameworks) internal mechanics dictate your data model. Tim On Mon, Jun 17, 2019, 4:59 AM Frank Wilson wrote: > Hi, > > Is it be

Re: Error while using session window

2019-06-17 Thread Piotr Nowojski
Hi, Thanks for reporting the issue. I think this might be caused by System.currentTimeMillis() not being monotonic [1] and the fact Flink is accessing this function per element multiple times (at least twice: first for creating a window, second to perform the check that has failed in your case)

Unexpected behavior from interval join in Flink

2019-06-17 Thread Wouter Zorgdrager
Hi all, I'm experiencing some unexpected behavior using an interval join in Flink. I'm dealing with two data sets, lets call them X and Y. They are finite (10k elements) but I interpret them as a DataStream. The data needs to be joined for enrichment purposes. I use event time and I know (because

What order are events processed in iterative loop?

2019-06-17 Thread John Tipper
For the case of a single iteration of an iterative loop where the feedback type is different to the input stream type, what order are events processed in the forward flow? So for example, if we have: * the input stream contains input1 followed by input2 * a ConnectedIterativeStream at th

Flink state: complex value state pojos vs explicitly managed fields

2019-06-17 Thread Frank Wilson
Hi, Is it better to have one POJO value state with a collection inside or an explicit state declaration for each member? e.g. MyPojo { long id; List[Foo] foos; // getter / setters omitted } Or Two managed state declarations in my process function (a value for the long and a list fo

Re: [ANNOUNCE] Weekly Community Update 2019/24

2019-06-17 Thread Konstantin Knauf
Hi Zili, thank you for adding these threads :) I would have otherwise picked them up next week, just couldn't put everything into one email. Cheers, Konstantin On Sun, Jun 16, 2019 at 11:07 PM Zili Chen wrote: > Hi Konstantin and all, > > Thank Konstantin very much for reviving this tradition

StreamingFileSink with hdfs less than 2.7

2019-06-17 Thread Rinat
Hi mates, I decided to enable persist the state of our flink jobs, that write data into hdfs, but got some troubles with that. I’m trying to use StreamingFileSink with cloudera hadoop, which version is 2.6.5, and it doesn’t contain truncate method. So, job fails immediately when it’s trying to

Re: Loading state from sink/understanding recovery options

2019-06-17 Thread Eduardo Winpenny Tejedor
Hi Congixian, I don't use Flink at the moment, I am trying to evaluate its suitability for my company's purposes by re-writing one of our apps with Flink. We have apps with similar business logic but different code, despite they do essentially the same thing. I am new to the streaming paradigms an