Re: Conceptual question

2018-06-12 Thread David Anderson
if I misunderstood. Thank you. > > Best Regards, > Tony Wei > > > > 2018-06-09 9:45 GMT+08:00 TechnoMage : > >> Thank you all. This discussion is very helpful. It sounds like I can >> wait for 1.6 though given our development status. >> >> Michael >

Re: Conceptual question

2018-06-08 Thread David Anderson
teV1”. >>>> >>>> I have once implemented something like that here: >>>> >>>> https://github.com/pnowojski/flink/blob/bfc8858fc4b9125b8fc7 >>>> acd03cb3f95c000926b2/flink-streaming-java/src/main/java/org/ >>>> apache/flink/str

Re: streaming predictions

2018-07-24 Thread David Anderson
Since the predict function does only support DataSet and not DataStream, >> on stackoverflow a flink contributor mentioned that this should be done >> using a map/flatMap function. >> Unfortunately I am not able to work this function out. >> >> It would be incredible for me if you could help me a little bit further! >> >> Kind regards and thanks in advance >> Cederic Bosmans >> > > > -- *David Anderson* | Training Coordinator | data Artisans -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time

Re: test windows

2018-08-31 Thread David Anderson
t; >>> best, >>> Zhengwen >>> >>> On Tue, Aug 28, 2018 at 7:35 PM Nicos Maris wrote: >>>> >>>> Hi all, >>>> >>>> >>>> How can I test in Java any streaming job that has a time window? >>>>

Re: In-Memory Lookup in Flink Operators

2018-10-01 Thread David Anderson
-> Reload lookup data source into in-memory table >> -> Continue processing >> >> >> My concern around these are : >> >> 1) Possibly storing the same copy of data in every Task slots memory or >> state backend(RocksDB in my case). >> 2) Having a dedicated refresh thread for each subtask instance(possibly, >> every Task Manager having multiple refresh thread) >> >> Am i thinking in the right direction? Or missing something very obvious? >> It confusing. >> >> Any leads are much appreciated. Thanks in advance. >> >> Cheers, >> Chirag >> >> >> -- >> Ken Krugler >> +1 530-210-6378 >> http://www.scaleunlimited.com >> Custom big data solutions & training >> Flink, Solr, Hadoop, Cascading & Cassandra >> >> -- *David Anderson* | Training Coordinator | data Artisans -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time

Re: GlobalWindow with custom tigger doesn't work correctly

2019-01-23 Thread David Anderson
ime(long time, GlobalWindow window, > TriggerContext ctx) throws Exception { > return TriggerResult.CONTINUE; > } > > @Override > public TriggerResult onEventTime(long time, GlobalWindow window, > TriggerContext ctx) throws Exception { > return TriggerResult.CONTINUE; > } > > @Override > public void clear(GlobalWindow window, TriggerContext ctx) throws > Exception { > ctx.getPartitionedState(prevLunum).clear(); > } > } > > I'm very grateful for your help. > > Regards, > > Daniel > > -- *David Anderson* | Training Coordinator -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time

Re: using updating shared data

2019-01-04 Thread David Anderson
Another solution to the watermarking issue is to write an AssignerWithPeriodicWatermarks for the control stream that always returns Watermark.MAX_WATERMARK as the current watermark. This produces watermarks for the control stream that will effectively be ignored. On Thu, Jan 3, 2019 at 9:18 PM

Re: How to clear keyed states periodically?

2018-09-14 Thread David Anderson
call state.clear(), but in this way I can only clear the state for one key, and actually I need a key group level cleanup. So I’m wondering is there any best practice for my case? Thanks a lot! >> >> Best, >> Paul Lam > > -- David Anderson | Training Coordinator | data Artisans -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time

Re: How to clear keyed states periodically?

2018-09-14 Thread David Anderson
it will be my last resort . > > I wonder if it’s possible to implement a clearAll() method for keyed > states which clears user states for all namespaces, and does it violate the > principle of keyed states? > > Thanks again! > > Best, > Paul Lam > > 在 2018年9月14日,16

Re: Watermarks in Event Time windowing

2018-09-14 Thread David Anderson
; Hi All, >> Can someone show a good example of how watermarks need to be >> generated when using EventTime windows? What happens if the watermark is >> same as the timestamp? How does the watermark help in the window to be >> triggered and what if watermarks a

Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread David Anderson
/flink-docs-stable/ops/config.html#metrics David Anderson | Training Coordinator Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time On Sat, Mar 23, 2019 at 1:03 PM Padarn Wilson wrote: > > Hi User, > > I am runnin

Re: Flink metrics missing from UI 1.7.2

2019-03-23 Thread David Anderson
t seem to be exported to datadog (should > it be?) > > > > On Sat, Mar 23, 2019 at 8:43 PM David Anderson > wrote: > >> Because latency tracking is expensive, it is turned off by default. You >> turn it on by setting the interval; that looks something like this: >

Re: Ask about running Flink sql-client.sh

2019-05-02 Thread David Anderson
There are some step-by-step instructions for setting up the sql client in https://training.ververica.com/setup/sqlClient.html, plus some examples.

Re: Ask about Running Flink Jobs From Eclipse

2019-05-02 Thread David Anderson
When you run a Flink job from within an IDE, you end up running with a LocalStreamEnvironment (rather than a remote cluster) that by default does not provide the Web UI. If you want the Flink running in the IDE to have its own dashboard, you can do this by adding this to your application:

Re: Can I use watermarkers to have a global trigger of different ProcessFunction's?

2019-08-21 Thread David Anderson
What Watermarks do is to advance the event time clock. You can consider a Watermark(t) as an assertion about the completeness of the stream -- it marks a point in the stream and says that at that point, the stream is (probably) now complete up to time t. The autoWatermarkInterval determines how

Re: Window Function that releases when downstream work is completed

2019-08-21 Thread David Anderson
I'm not sure I fully understand the scenario you envision. Are you saying you want to have some sort of window that batches (and deduplicates) up until a downstream map has finished processing the previous deduplicated batch, and then the window should emit the new batch? If that's what you want,

Re: Multiple trigger events on keyed window

2019-08-22 Thread David Anderson
The role of the watermarks in your job will be to trigger the closing of the sliding event time windows. In order to play that role properly, they should be based on the timestamps in the events, rather than some arbitrary constant (L). The reason why the same object is responsible for

Re: Multiple trigger events on keyed window

2019-08-22 Thread David Anderson
If you still need help diagnosing the cause of the misbehavior, please share more of the code with us. On Wed, Aug 21, 2019 at 6:24 PM Eric Isling wrote: > > I should add that the behaviour persists, even when I force parallelism to 1. > > On Wed, Aug 21, 2019 at 5:19 PM Eric Isling wrote: >>

Re: Can I use watermarkers to have a global trigger of different ProcessFunction's?

2019-08-23 Thread David Anderson
e.o.gutierrez > -- https://felipeogutierrez.blogspot.com > > > On Wed, Aug 21, 2019 at 9:09 PM David Anderson wrote: >> >> What Watermarks do is to advance the event time clock. You can >> consider a Watermark(t) as an assertion about the completeness of the >> stream -

Re: Running flink on AWS ECS

2019-09-28 Thread David Anderson
I believe there can be advantages and disadvantages in both directions. For example, fewer containers with multiple slots reduces the effort the Flink Master has to do whenever global coordination is required, i.e., during checkpointing. And the network stack in the task managers is optimized to

Re: Re ordering events with flink

2019-11-02 Thread David Anderson
Yes, it's possible to sort a stream by the event time timestamps on the events. This depends on having reliable watermarks -- as events that are late can not be emitted in order. There are several ways to accomplish this. You could, for example, use a ProcessFunction, and implement the sorting

Re: Order events by filed that does not represent time

2019-12-11 Thread David Anderson
Krzysztof, Note that if you want to have Flink treat these sequence numbers as event time timestamps, you probably can, so long as they are generally increasing, and there's some bound on how out-of-order they can be. The advantage to doing this is that you might be able to use Flink SQL's event

Re: Apache Flink - Clarifications about late side output

2019-12-11 Thread David Anderson
I'll attempt to answer your questions. If we have allowed lateness to be greater than 0 (say 5), then if an event > which arrives at window end + 3 (within allowed lateness), > (a) it is considered late and included in the window function as a > late firing ? > An event with a timestamp that

Re: Apache Flink - Clarifications about late side output

2019-12-11 Thread David Anderson
code. On Wed, Dec 11, 2019 at 2:08 PM David Anderson wrote: > I'll attempt to answer your questions. > > If we have allowed lateness to be greater than 0 (say 5), then if an event >> which arrives at window end + 3 (within allowed lateness), >> (a) it is considered

Re: Watermark won't advance in ProcessFunction

2019-10-28 Thread David Anderson
The reason why the watermark is not advancing is that assignAscendingTimestamps is a periodic watermark generator. This style of watermark generator is called at regular intervals to create watermarks -- by default, this is done every 200 msec. With only a tiny bit of data to process, the job

Re: How to convert retract stream to dynamic table?

2019-12-19 Thread David Anderson
The Elasticsearch, HBase, and JDBC[1] table sinks all support streaming UPSERT mode[2]. While not exactly the same as RETRACT mode, it seems like this might get the job done (unless I'm missing something, which is entirely possible). David [1]

Re: ValueState with pure Java class keeping lists/map vs ListState/MapState, which one is a recommended way?

2020-01-18 Thread David Anderson
[Note that this question is better suited for the user mailing list than dev.] In general using ListState and MapState is recommended rather than using ValueState> or ValueState>. Some of the state backends are able to optimize common access patterns for ListState and MapState in ways that are

Re: Influxdb reporter not honouring the metrics scope

2020-01-22 Thread David Anderson
Gaurav, I haven't used it for a couple of years, so I don't know if it still works, but https://github.com/jgrier/flink-stuff/tree/master/flink-influx-reporter is an influxdb reporter (wrapped around https://github.com/davidB/metrics-influxdb/tree/master/src/main/java/metrics_influxdb) that uses

Re: How to get kafka record's timestamp in job

2019-12-31 Thread David Anderson
> In kafka010, ConsumerRecord has a field named timestamp. It is encapsulated in Kafka010Fetcher. > How can I get the timestamp when I write a flink job? Kafka010Fetcher puts the timestamps into the StreamRecords that wrap your events. If you want to access these timestamps, you can use a

Re: Do I need to set assignTimestampsAndWatermarks if I set my time characteristic to IngestionTime?

2020-03-10 Thread David Anderson
Watermarks are a tool for handling out-of-orderness when working with event time timestamps. They provide a mechanism for managing the tradeoff between latency and completeness, allowing you to manage how long to wait for any out-of-orderness to resolve itself. Note the way that Flink uses these

Re: RocksDB

2020-03-10 Thread David Anderson
The State Processor API goes a bit in the direction you asking about, by making it possible to query savepoints. https://ci.apache.org/projects/flink/flink-docs-stable/dev/libs/state_processor_api.html

[PROPOSAL] Contribute training materials to Apache Flink

2020-04-09 Thread David Anderson
Dear Flink Community! For some time now Ververica has been hosting some freely available Apache Flink training materials at https://training.ververica.com. This includes tutorial content covering the core concepts of the DataStream API, and hands-on exercises that accompany those explanations.

Re: Is it possible to emulate keyed state with operator state?

2020-04-10 Thread David Anderson
Hypothetically, yes, I think this is possible to some extent. You would have to give up all the things that require a KeyedStream, such as timers, and the RocksDB state backend. And performance would suffer. As for the question of determining which key groups (and ultimately, which keys) are

Re: Upgrading Flink

2020-04-14 Thread David Anderson
@Chesnay Flink doesn't seem to guarantee client-jobmanager compability, even for bug-fix releases. For example, some jobs compiled with 1.9.0 don't work with a cluster running 1.9.2. See https://github.com/ververica/sql-training/issues/8#issuecomment-590966210 for an example of a case when

Re: [PROPOSAL] Contribute training materials to Apache Flink

2020-04-15 Thread David Anderson
he community should be aware of is that we also need to > maintain the training material. Maybe you could share with us how much > effort this usually is when updating the training material for a new Flink > version, David? > > Cheers, > Till > > On Fri, Apr 10, 2020 a

Re: [PROPOSAL] Contribute training materials to Apache Flink

2020-04-10 Thread David Anderson
org and flink playgrounds respectively are the >> best places to host this content. >> >> On Thu, Apr 9, 2020 at 2:56 PM Niels Basjes wrote: >> >>> Hi, >>> >>> Sounds like a very nice thing to have as part of the project ecosystem. >

Re: Sync two DataStreams

2020-04-04 Thread David Anderson
There are a few ways to pre-ingest data from a side input before beginning to process another stream. One is to use the State Processor API [1] to create a savepoint that has the data from that side input in its state. For a simple example of bootstrapping state into a savepoint, see [2]. Another

Re: Ask for reason for choice of S3 plugins

2020-03-27 Thread David Anderson
ent directory - all these existence requests are S3 HEAD requests, which have very low request rate limits *David Anderson* | Training Coordinator Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time On Fri, Mar 27, 2020 at 7

Re: How to debug checkpoints failing to complete

2020-03-25 Thread David Anderson
seeksst has already covered many of the relevant points, but a few more thoughts: I would start by checking to see if the checkpoints are failing because they timeout, or for some other reason. Assuming they are timing out, then a good place to start is to look and see if this can be explained by

Re: "Fill in" notification messages based on event time watermark

2020-04-27 Thread David Anderson
Following up on Piotr's outline, there's an example in the documentation of how to use a KeyedProcessFunction to implement an event-time tumbling window [1]. Perhaps that can help you get started. Regards, David [1]

Re: Autoscaling flink application

2020-05-05 Thread David Anderson
There's no explicit, out-of-the-box support for autoscaling, but it can be done. Autoscaling came up a few times at the recent Virtual Flink Forward, including a talk on Autoscaling Flink at Netflix [1] by Timothy Farkas. [1] https://www.youtube.com/watch?v=NV0jvA5ZDNc Regards, David On Mon,

Re: checkpointing opening too many file

2020-05-07 Thread David Anderson
With the FsStateBackend you could also try increasing the value of state.backend.fs.memory-threshold [1]. Only those state chunks that are larger than this value are stored in separate files; smaller chunks go into the checkpoint metadata file. The default is 1KB, increasing this should reduce

Re: FlinkCEP - Detect absence of a certain event

2020-03-19 Thread David Anderson
Humberto, Although Flink CEP lacks notFollowedBy at the end of a Pattern, there is a way to implement this by exploiting the timeout feature. The Flink training includes an exercise [1] where the objective is to identify taxi rides with a START event that is not followed by an END event within

Re: Can't create a savepoint with State Processor API

2020-03-19 Thread David Anderson
/alpinegizmo/ff3d2e748287853c88f21259830b29cf for my version. *David Anderson* | Training Coordinator Follow us @VervericaData -- Join Flink Forward - The Apache Flink Conference Stream Processing | Event Driven | Real Time On Thu, Mar 19, 2020 at 2:54 AM Dmitry Minaev wrote: > Hi every

Re: what is the difference between map vs process on a datastream?

2020-03-17 Thread David Anderson
Map applies a MapFunction (or a RichMapFunction) to a DataStream and does a one-to-one transformation of the stream elements. Process applies a ProcessFunction, which can produce zero, one, or many events in response to each event. And when used on a keyed stream, a KeyedProcessFunction can use

Re: Flink not outputting windows before all data is seen

2020-09-01 Thread David Anderson
ad set the parallelism to 1 > >job wide, but when I reran the task, I added your line. Are there any > >circumstances where despite having the global level set to 1, you > >still need to set the level on individual operators? > > > >PS: I sent this to you directly I'm so

Re: How to get Latency Tracking results?

2020-09-09 Thread David Anderson
Pankaj, The Flink web UI doesn't do any visualizations of histogram metrics, so the only way to access the latency metrics is either through the REST api or a metrics reporter. The REST endpoint you tried is the correct place to find these metrics in all recent versions of Flink, but somewhere

Re: How to get Latency Tracking results?

2020-09-09 Thread David Anderson
ommit ID: 7eb514a. > Is it possible that the default SocketWindowWordCount job is too simple to > generate Latency metrics? Or that the latency metrics disappear from the > output JSON when the data ingestion is zero? > > Thanks, > > Pankaj > > > On Wed, Sep 9, 2020 at 6:27 AM David An

Re: Slow Performance inquiry

2020-09-09 Thread David Anderson
Heidy, which state backend are you using? With RocksDB Flink will have to do ser/de on every access and update, but with the FsStateBackend, your sparse matrix will sit in memory, and only have to be serialized during checkpointing. David On Wed, Sep 9, 2020 at 2:41 PM Heidi Hazem Mohamed

Re: Watermark generation issues with File sources in Flink 1.11.1

2020-09-09 Thread David Anderson
Arti, The problem with watermarks and the File source operator will be fixed in 1.11.2 [1]. This bug was introduced in 1.10.0, and isn't related to the new WatermarkStrategy api. [1] https://issues.apache.org/jira/browse/FLINK-19109 David On Wed, Sep 9, 2020 at 2:52 PM Arti Pande wrote: > Hi

Re: Connecting two streams and order of their processing

2020-09-16 Thread David Anderson
The details of what can go wrong will vary depending on the precise scenario, but no, Flink is unable to provide any such guarantee. Doing so would require being able to control the scheduling of various threads running on different machines, which isn't possible. Of course, if event A becomes

Re: How to get Latency Tracking results?

2020-09-10 Thread David Anderson
s, which is not happening in my case. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-release-1.11/monitoring/metrics.html#latency-tracking >> >> On Wed, Sep 9, 2020 at 10:08 AM David Anderson >> wrote: >> >>> Pankaj, >>> >

Re: S3 StreamingFileSink issues

2020-10-07 Thread David Anderson
Dan, The first point you've raised is a known issue: When a job is stopped, the unfinished part files are not transitioned to the finished state. This is mentioned in the docs as Important Note 2 [1], and fixing this is waiting on FLIP-46 [2]. That section of the docs also includes some

Re: S3 StreamingFileSink issues

2020-10-07 Thread David Anderson
uld happen, happy to submit a > patch. > > > > > On Wed, Oct 7, 2020 at 1:37 AM David Anderson > wrote: > >> Dan, >> >> The first point you've raised is a known issue: When a job is stopped, >> the unfinished part files are not transitioned to the f

Re: how to simulate the scene "back pressure" in flink?Thanks~!

2020-10-09 Thread David Anderson
; could you tell me where to set RestOption.POPT?in configuration >> what's the value should I set for RestOption.PORT? >> >> Thanks. >> >> >> -- 原始邮件 -- >> *发件人:* "Arvid Heise" ; >> *发送时间:* 2020年10月9日(星期五) 下午3:00

Re: what's the datasets used in flink sql document?

2020-10-15 Thread David Anderson
For a dockerized playground that includes a dataset, many working examples, and training slides, see [1]. [1] https://github.com/ververica/sql-training David On Thu, Oct 15, 2020 at 10:18 AM Piotr Nowojski wrote: > Hi, > > The examples in the documentation are not fully functional. They

Re: need help about "incremental checkpoint",Thanks

2020-10-02 Thread David Anderson
It looks like you were trying to resume from a checkpoint taken with the FsStateBackend into a revised version of the job that uses the RocksDbStateBackend. Switching state backends in this way is not supported: checkpoints and savepoints are written in a state-backend-specific format, and can

Re: need help about "incremental checkpoint",Thanks

2020-10-02 Thread David Anderson
kEnd)? > > Much Thanks~! > > > -- 原始邮件 -- > 发件人: "大森林" ; > 发送时间: 2020年10月2日(星期五) 晚上11:41 > 收件人: "David Anderson"; > 抄送: "user"; > 主题: 回复: need help about "incremental checkpoint",Thanks &g

Re: need help about "incremental checkpoint",Thanks

2020-10-02 Thread David Anderson
eBackend.* > *It's NOT a match.* > So I'm wrong in step 5? > Is my above understanding right? > > Thanks for your help. > > ------ 原始邮件 -- > *发件人:* "David Anderson" ; > *发送时间:* 2020年10月2日(星期五) 晚上10:35 > *收件人:* "大森林"; > *抄送

Re: need help about "incremental checkpoint",Thanks

2020-10-06 Thread David Anderson
. Best, David On Mon, Oct 5, 2020 at 2:38 PM 大森林 wrote: > Could you give more details? > Thanks > > > -- 原始邮件 -- > *发件人:* "大森林" ; > *发送时间:* 2020年10月3日(星期六) 上午9:30 > *收件人:* "David Anderson"; > *抄送:* "user"; &g

Re: [DISCUSS] Remove flink-connector-filesystem module.

2020-10-13 Thread David Anderson
I think the pertinent question is whether there are interesting cases where the BucketingSink is still a better choice. One case I'm not sure about is the situation described in docs for the StreamingFileSink under Important Note 2 [1]: ... upon normal termination of a job, the last

Re: how to simulate the scene "back pressure" in flink?Thanks~!

2020-10-10 Thread David Anderson
erstand both: > > ---- > Dear David Anderson: > Is the whole command like this? > flink run *--backpressure* -c wordcount_increstate > d

Re: how to simulate the scene "back pressure" in flink?Thanks~!

2020-10-10 Thread David Anderson
t 5:32 PM 大森林 wrote: > > Could I use your command with no docker? > > -- 原始邮件 ------ > *发件人:* "David Anderson" ; > *发送时间:* 2020年10月10日(星期六) 晚上10:30 > *收件人:* "大森林"; > *抄送:* "Arvid Heise";"user";

Re: Ververica Flink training resources

2020-08-23 Thread David Anderson
Piper, 1. Thanks for reporting the problem with the broken links. I've just fixed this. 2. The exercises were recently rewritten so that they no longer use the old file-based datasets. Now they use data generators that are included in the project. As part of this update, the schema was modified

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-18 Thread David Anderson
or be ignored (but not throw an exception). I > could also see ignoring the timers in batch as the default, if this > makes more sense. > > By the way, do you have any usecases in mind that will help us better > shape our processing time timer handling? > > Kostas > > On

Re: Failures due to inevitable high backpressure

2020-08-26 Thread David Anderson
You should begin by trying to identify the cause of the backpressure, because the appropriate fix depends on the details. Possible causes that I have seen include: - the job is inadequately provisioned - blocking i/o is being done in a user function - a huge number of timers are firing

Re: Flink not outputting windows before all data is seen

2020-08-29 Thread David Anderson
Teodor, This is happening because of the way that readTextFile works when it is executing in parallel, which is to divide the input file into a bunch of splits, which are consumed in parallel. This is making it so that the watermark isn't able to move forward until much or perhaps all of the file

Re: How to trigger only once for each window when using TumblingProcessingTimeWindows?

2020-08-19 Thread David Anderson
The purpose of the reduce() and aggregate() methods on windows is to allow for incremental computation of window results. This has two principal advantages: (1) the computation of the results is spread out, rather than occurring all in one go at the end of each window, thereby reducing the

Re: Failures due to inevitable high backpressure

2020-08-26 Thread David Anderson
On Wed, Aug 26, 2020 at 7:41 PM David Anderson wrote: > You should begin by trying to identify the cause of the backpressure, > because the appropriate fix depends on the details. > > Possible causes that I have seen include: > > - the job is inadequately provisioned > - block

Re: Ververica Flink training resources

2020-08-24 Thread David Anderson
in the > new repo. > > Best, > Piper > > On Sun, Aug 23, 2020 at 10:15 AM David Anderson > wrote: > >> Piper, >> >> 1. Thanks for reporting the problem with the broken links. I've just >> fixed this. >> >> 2. The exercises were recently r

Re: [DISCUSS] Removing deprecated methods from DataStream API

2020-08-17 Thread David Anderson
I assume that along with DataStream#fold you would also remove WindowedStream#fold. I'm in favor of going ahead with all of these. David On Mon, Aug 17, 2020 at 10:53 AM Dawid Wysakowicz wrote: > Hi devs and users, > > I wanted to ask you what do you think about removing some of the >

Re: [DISCUSS] FLIP-134: DataStream Semantics for Bounded Input

2020-08-17 Thread David Anderson
Kostas, I'm pleased to see some concrete details in this FLIP. I wonder if the current proposal goes far enough in the direction of recognizing the need some users may have for "batch" and "bounded streaming" to be treated differently. If I've understood it correctly, the section on scheduling

Re: Efficiently processing sparse events in a time windows

2020-09-24 Thread David Anderson
Steven, I'm pretty sure this is a scenario that doesn't have an obvious good solution. As you have discovered, the window API isn't much help; using a process function does make sense. The challenge is finding a data structure to use in keyed state that can be efficiently accessed and updated.

Re: Experiencing different watermarking behaviour in local/standalone modes when reading from kafka topic with idle partitions

2020-10-01 Thread David Anderson
If you were to use per-partition watermarking, which you can do by calling assignTimestampsAndWatermarks directly on the Flink Kafka consumer [1], then I believe the idle partition(s) would consistently hold back the overall watermark. With per-partition watermarking, each Kafka source task will

Re: BufferTimeout throughput vs latency.

2020-09-16 Thread David Anderson
The performance loss being referred to there is reduced throughput. There's a blog post by Nico Kruber [1] that covers Flink's network stack in considerable detail. The last section on latency vs throughput gives some more insight on this point. In the experiment reported on there, the difference

Re: Automatically restore from checkpoint

2020-09-18 Thread David Anderson
If your job crashes, Flink will automatically restart from the latest checkpoint, without any manual intervention. JobManager HA is only needed for automatic recovery after the failure of the Job Manager. You only need externalized checkpoints and "-s :checkpointPath" if you want to use

Re: How do I trigger clear custom state in ProcessWindowsFunction

2020-07-18 Thread David Anderson
ProcessWindowFunction#process is passed a Context object that contains public abstract KeyedStateStore windowState(); public abstract KeyedStateStore globalState(); which are available for you to use for custom state. Whatever you store in windowState is scoped to a window, and is cleared

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-03 Thread David Anderson
Jincheng, One thing that I like about the way that the documentation is currently organized is that it's relatively straightforward to compare the Python API with the Java and Scala versions. I'm concerned that if the PyFlink docs are more independent, it will be challenging to respond to

Re: Submit Flink 1.11 job from java

2020-08-07 Thread David Anderson
Flavio, Have you looked at application mode [1] [2] [3], added in 1.11? It offers at least some of what you are looking for -- the application jar and its dependencies can be pre-uploaded to HDFS, and the main() method runs on the job manager, so none of the classes have to be loaded in the

Re: [DISCUSS] FLIP-133: Rework PyFlink Documentation

2020-08-04 Thread David Anderson
ent structure. > > >> it's relatively straightforward to compare the Python API with the > Java and Scala versions. > > Regarding the comparison between Python API and Java/Scala API, I think > the majority of users, especially the beginner users, would not have this > deman

Re: Support for Event time clock specific to each stream in parallel streams

2020-07-31 Thread David Anderson
It sounds like you would like to have something like event-time-based windowing, but with independent watermarking for every key. An approach that can work, but it is somewhat cumbersome, is to not use watermarks or windows, but instead put all of the logic in a KeyedProcessFunction (or

Re: [Flink Unit Tests] Unit test for Flink streaming codes

2020-08-02 Thread David Anderson
Vijay, There's a section of the docs that describes some strategies for writing tests of various types, and it includes some Scala examples [1]. There are also some nice examples from Konstantin Knauf in [2], though they are mostly in Java. [1]

Re: Question about Watermarks within a KeyedProcessFunction

2020-06-27 Thread David Anderson
With an AscendingTimestampExtractor, watermarks are not created for every event, and as your job starts up, some events will be processed before the first watermark is generated. The impossible value you see is an initial value that's in place until the first real watermark is available. On the

Re: Restore from savepoint through Java API

2020-06-12 Thread David Anderson
You can study LocalStreamingFileSinkTest [1] for an example of how to approach this. You can use the test harnesses [2], keeping in mind that - initializeState is called during instance creation - the provided context indicates if state is being restored from a snapshot - snapshot is called when

Re: Is there a way to use stream API with this program?

2020-07-25 Thread David Anderson
In this use case, couldn't the custom trigger register an event time timer for MAX_WATERMARK, which would be triggered when the bounded input reaches its end? David On Mon, Jul 20, 2020 at 5:47 PM Piotr Nowojski wrote: > Hi, > > I'm afraid that there is not out of the box way of doing this.

Re: Count of records coming into the Flink App from KDS and at each step through Execution Graph

2020-07-25 Thread David Anderson
Setting up a Flink metrics dashboard in Grafana requires setting up and configuring one of Flink's metrics reporters [1] that is supported by Grafana as a data source. That means your options for a metrics reporter are Graphite, InfluxDB, Prometheus, or the Prometheus push reporter. If you want

Re: CEP use case ?

2020-07-17 Thread David Anderson
If the rules can be implemented by examining events in isolation (e.g., temperature > 25), then the DataStream API is all you need. But if you want rules that are looking for temporal patterns that play across multiple events, then CEP or MATCH_RECOGNIZE (part of Flink SQL) will simplify the

Re: How to write junit testcases for KeyedBroadCastProcess Function

2020-07-17 Thread David Anderson
You could approach testing this in the same way that Flink has implemented its unit tests for KeyedBroadcastProcessFunctions, which is to use a KeyedTwoInputStreamOperatorTestHarness with a CoBroadcastWithKeyedOperator. To learn how to use Flink's test harnesses, see [1], and for an example of

Re: Backpressure on Window step

2020-07-17 Thread David Anderson
Backpressure is typically caused by something like one of these things: * problems relating to i/o to external services (e.g., enrichment via an API or database lookup, or a sink) * data skew (e.g., a hot key) * under-provisioning, or competition for resources * spikes in traffic * timer storms

Re: Is outputting from components other than sinks or side outputs a no-no ?

2020-07-26 Thread David Anderson
Every job is required to have a sink, but there's no requirement that all output be done via sinks. It's not uncommon, and doesn't have to cause problems, to have other operators that do I/O. What can be problematic, however, is doing blocking I/O. While your user function is blocked, the

Re: Is there a way to use stream API with this program?

2020-07-28 Thread David Anderson
lly by the > WatermarkAssignerOperator. > > Piotrek > > pon., 27 lip 2020 o 09:16 Flavio Pompermaier > napisał(a): > >> Yes it could..where should I emit the MAX_WATERMARK and how do I detect >> that the input reached its end? >> >> On Sat,

Re: Removing stream in a job having state

2020-07-28 Thread David Anderson
When you modify a job by removing a stateful operator, then during a restart when Flink tries to restore the state, it will complain that the snapshot contains state that can not be restored. The solution to this is to restart from the savepoint (or retained checkpoint), specifying that you want

Re: Is it possible to do state migration with checkpoints?

2020-07-23 Thread David Anderson
I believe this should work, with a couple of caveats: - You can't do this with unaligned checkpoints - If you have dropped some state, you must specify --allowNonRestoredState when you restart the job David On Wed, Jul 22, 2020 at 4:06 PM Sivaprasanna wrote: > Hi, > > We are trying out state

Re: Flink FsStatebackend is giving better performance than RocksDB

2020-07-18 Thread David Anderson
You should be able to tune your setup to avoid the OOM problem you have run into with RocksDB. It will grow to use all of the memory available to it, but shouldn't leak. Perhaps something is misconfigured. As for performance, with the FSStateBackend you can expect: * much better throughput and

Re: Flink Sinks

2020-07-18 Thread David Anderson
Prasanna, The Flink project does not have an SQS connector, and a quick google search hasn't found one. Nor does Flink have an HTTP sink, but with a bit of googling you can find that various folks have implemented this themselves. As for implementing SQS as a custom sink, if you need exactly

Re: Backpressure on Window step

2020-07-18 Thread David Anderson
; between 136kb and 2.45mb of state, with a checkpoint duration of 280ms to 2 > seconds. Each of the 32 subtasks appear to be handling 1,700-50,000 records > an hour with a bytes received of 7mb and 170mb > > > > Am I barking up the wrong tree? > > > > -Steve > >

Re: Print on screen DataStream content

2020-11-24 Thread David Anderson
When Flink is running on a cluster, `DataStream#print()` prints to files in the log directory. Regards, David On Tue, Nov 24, 2020 at 6:03 AM Pankaj Chand wrote: > Please correct me if I am wrong. `DataStream#print()` only prints to the > screen when running from the IDE, but does not work

Re: taskmanager.cpu.cores 1.7976931348623157E308

2020-12-05 Thread David Anderson
taskmanager.cpu.cores is intended for internal use only -- you aren't meant to set this option. What happens if you leave it alone? Regards, David On Sat, Dec 5, 2020 at 8:04 AM Rex Fenley wrote: > We're running this in a local environment so that may be contributing to > what we're seeing. >

Re: Print on screen DataStream content

2020-11-24 Thread David Anderson
nt()` but I don't quite understand how to > > implement it. Could you please give me an example? I'm using Intellij so > > what I would need is just to see the data on my screen. > > > > Thanks > > > > > >

Re: Disk usage during savepoints

2020-12-12 Thread David Anderson
RocksDB does do compaction in the background, and incremental checkpoints simply mirror to S3 the set of RocksDB SST files needed by the current set of checkpoints. However, unlike checkpoints, which can be incremental, savepoints are always full snapshots. As for why one host would have much

  1   2   3   >