Re: Flink Table Kinesis sink not failing when sink fails

2022-11-30 Thread Dan Hill
5 it can via > 'sink.fail-on-error: true' > > Thanks > > On Wed, Nov 30, 2022 at 5:41 AM Dan Hill wrote: > >> My text logs don't have a stack trace with this exception. I'm doing >> this inside Flink SQL with a standard Kinesis connector and JSON formatter. >> >

Re: Flink Table Kinesis sink not failing when sink fails

2022-11-29 Thread Dan Hill
en log the error message, in which case Flink job won't > fail since it seems like no exception happens. > > Best regards, > Yuxia > > ------ > *发件人: *"Dan Hill" > *收件人: *"User" > *发送时间: *星期三, 2022年 11 月 30日 上午 8:06:52 > *主题: *Flink

Flink Table Kinesis sink not failing when sink fails

2022-11-29 Thread Dan Hill
I set up a simple Flink SQL job (Flink v1.14.4) that writes to Kinesis. The job looks healthy but the records are not being written. I did not give enough IAM permissions to write to Kinesis. However, the Flink SQL job acts like it's healthy and checkpoints even though the Kinesis PutRecords

Re: Example Flink SQL fromDataStream watermark code not showing *rowtime*

2022-11-27 Thread Dan Hill
I figured this out. I get this behavior because I was running the code in a minicluster test that defaulted to batch. I switched to streaming and it renders "*ROWTIME*". On Fri, Nov 25, 2022 at 11:51 PM Dan Hill wrote: > Hi. I copied the Flink code from this page. My print

Example Flink SQL fromDataStream watermark code not showing *rowtime*

2022-11-25 Thread Dan Hill
Hi. I copied the Flink code from this page. My printSchema() does not contain **ROWTIME** in the output. I'm running Flink v1.14.4. https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/data_stream_api/ public static class User {...} DataStream dataStream =

Re: Weird Flink SQL error

2022-11-24 Thread Dan Hill
etl/test_content_metrics', >'format' = 'json', > ) > > > Best, > Leonard > > > On Nov 25, 2022, at 11:20 AM, Dan Hill wrote: > > Also, if I try to do an aggregate inside the ROW, I get an error. I don't > get the error if it's not wrapped in.a Row. > > ROW(

Re: Weird Flink SQL error

2022-11-24 Thread Dan Hill
rImpl.java:40981) org.apache.flink.sql.parser.impl.FlinkSqlParserImpl.jj_consume_token(FlinkSqlParserImpl.java:40792) org.apache.flink.sql.parser.impl.FlinkSqlParserImpl.SqlStmtEof(FlinkSqlParserImpl.java:3981) org.apache.flink.sql.parser.impl.FlinkSqlParserImpl.parseSqlStmtEof(FlinkSqlPa

Re: Weird Flink SQL error

2022-11-24 Thread Dan Hill
org.apache.flink.sql.parser.impl.FlinkSqlParserImpl.ParenthesizedSimpleIdentifierList(FlinkSqlParserImpl.java:25220) org.apache.flink.sql.parser.impl.FlinkSqlParserImpl.Expression3(FlinkSqlParserImpl.java:19925) org.apache.flink.sql.parser.impl.FlinkSqlParserImpl.Express

Re: Weird Flink SQL error

2022-11-23 Thread Dan Hill
If I remove the "TEMPORARY VIEW" and just inline the SQL, this works fine. This seems like a bug with temporary views. On Wed, Nov 23, 2022 at 1:38 PM Dan Hill wrote: > Looks related to this issue. > https://lists.apache.org/thread/1sb5bos6tjv39fh0wjkvmvht0824r4my > >

Re: Weird Flink SQL error

2022-11-23 Thread Dan Hill
DATE_FORMAT(TUMBLE_ROWTIME(rowtime, INTERVAL '1' DAY), '-MM-dd'), ROW( platform_id, content_id ) FROM content_event GROUP BY platform_id, content_id, TUMBLE(rowtime, INTERVAL '1' DAY) SELECT * FROM test_content_metrics_view On Wed, Nov 23, 2022 at 1:19 PM Dan Hill

Re: Weird Flink SQL error

2022-11-23 Thread Dan Hill
I upgraded to Flink v1.16.0 and I get the same error. On Wed, Nov 23, 2022 at 9:47 AM Dan Hill wrote: > For the error `Encountered "." at line 1, column 119.`, here are the > confusing parts: > > 1. The error happens when I executed the last part of the

Re: Weird Flink SQL error

2022-11-23 Thread Dan Hill
atement. 3. None of the SQL that I've written has a period "." in it. On Wed, Nov 23, 2022 at 8:32 AM Dan Hill wrote: > I'm using Flink 1.14.4 > > On Wed, Nov 23, 2022, 02:28 yuxia wrote: > >> Hi, Dan. >> I'm wondering what type of error you expect. IMO, I think m

Re: Weird Flink SQL error

2022-11-23 Thread Dan Hill
----- > *发件人: *"Dan Hill" > *收件人: *"User" > *发送时间: *星期三, 2022年 11 月 23日 下午 1:55:20 > *主题: *Weird Flink SQL error > > Hi. I'm hitting an obfuscated Flink SQL parser error. Is there a way to > get better errors for Flink SQL? I'm hitting

Weird Flink SQL error

2022-11-22 Thread Dan Hill
Hi. I'm hitting an obfuscated Flink SQL parser error. Is there a way to get better errors for Flink SQL? I'm hitting it when I wrap some of the fields on an inner Row. *Works* CREATE TEMPORARY VIEW `test_content_metrics_view` AS SELECT DATE_FORMAT(TUMBLE_ROWTIME(rowtime, INTERVAL '1'

Broadcast state and OutOfMemoryError: Direct buffer memory

2022-10-21 Thread Dan Hill
Hi. My team recently added broadcast state to our Flink jobs. We've started hitting this OOM "Direct buffer memory" error. Is this a common problem with broadcast state? Or is it likely a different problem? Thanks! - Dan

Re: RichAsyncFunction + Cache or Map State?

2022-05-09 Thread Dan Hill
Hi. Any advice on this? I just hit this too. Some ideas: 1. Manage our own separate cache (disk, Redis, etc). 2. Use two operators (first one a cache one and the second is the RichAsyncFunction). Have a feedback loop by using another Kafka topic or S3 File source/sink. On Wed, Feb 9, 2022 at

Re: Savepoint and cancel questions

2022-04-23 Thread Dan Hill
ode maybe what you want. > > On Sat, Apr 23, 2022 at 7:38 AM Dan Hill wrote: > >> Hi. >> >> 1. Why isn’t –externalizedCheckpointCleanup an option on savepoint >> (instead of being needed at the start of a job run)? >> >> 2. Can we get a confirmatio

Savepoint and cancel questions

2022-04-22 Thread Dan Hill
Hi. 1. Why isn’t –externalizedCheckpointCleanup an option on savepoint (instead of being needed at the start of a job run)? 2. Can we get a confirmation dialog when someone clicks "cancel job" in the UI? Just in case people click on accident. 3. Can we get a way to have Flink clean up the

Kubernetes killing TaskManager - Flink ignoring taskmanager.memory.process.size

2022-04-20 Thread Dan Hill
Hi. I upgraded to Flink v1.14.4 and now my Flink TaskManagers are being killed by Kubernetes for exceeding the requested memory. My Flink TM is using an extra ~5gb of memory over the tm.memory.process.size. Here are the flink-config values that I'm using taskmanager.memory.process.size:

Re: Weird Flink Kafka source watermark behavior

2022-03-18 Thread Dan Hill
roblem still observed even with # sour tasks = # partitions? > > For committers: > Is there a way to confirm that per-partition watermarking is used in TM > log? > > On Fri, Mar 18, 2022 at 4:14 PM Dan Hill wrote: > >> I hit this using event processing and no idleness detection.

Re: Weird Flink Kafka source watermark behavior

2022-03-18 Thread Dan Hill
u increase the value to 1 minute, your > backfill job should catch up real-time within 1 minute. > > Best, > > Dongwon > > > On Fri, Mar 18, 2022 at 3:51 PM Dan Hill wrote: > >> Thanks Dongwon! >> >> Wow. Yes, I'm using per-partition watermarking [1]. Ye

Re: Weird Flink Kafka source watermark behavior

2022-03-18 Thread Dan Hill
ster/docs/dev/datastream/event-time/generating_watermarks/#dealing-with-idle-sources > > Best, > > Dongwon > > On Fri, Mar 18, 2022 at 2:35 PM Dan Hill wrote: > >> I'm following the example from this section: >> >> https://nightlies.apache.org/flink/flink-docs-mas

Clarifying ProcessFunction.onTimer and watermark behavior

2022-03-17 Thread Dan Hill
Hi. This Flink page says the following: “With event-time timers, the onTimer(...) method is called when the current watermark is advanced up to or beyond the timestamp of the timer” The

Re: Weird Flink Kafka source watermark behavior

2022-03-17 Thread Dan Hill
I'm following the example from this section: https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/event-time/generating_watermarks/#watermark-strategies-and-the-kafka-connector On Thu, Mar 17, 2022 at 10:26 PM Dan Hill wrote: > Other points > - I'm using the kafka tim

Re: Weird Flink Kafka source watermark behavior

2022-03-17 Thread Dan Hill
Other points - I'm using the kafka timestamp as event time. - The same issue happens even if I use an idle watermark. On Thu, Mar 17, 2022 at 10:17 PM Dan Hill wrote: > There are 12 Kafka partitions (to keep the structure similar to other low > traffic environments). > > On Thu,

Re: Weird Flink Kafka source watermark behavior

2022-03-17 Thread Dan Hill
There are 12 Kafka partitions (to keep the structure similar to other low traffic environments). On Thu, Mar 17, 2022 at 10:13 PM Dan Hill wrote: > Hi. > > I'm running a backfill from a kafka topic with very few records spread > across a few days. I'm seeing a case where the re

Weird Flink Kafka source watermark behavior

2022-03-17 Thread Dan Hill
Hi. I'm running a backfill from a kafka topic with very few records spread across a few days. I'm seeing a case where the records coming from a kafka source have a watermark that's more recent (by hours) than the event time. I haven't seen this before when running this. This violates what I'd

Re: Flink UI - Operator Chaining - broken with "Records Sent"

2022-03-17 Thread Dan Hill
This is on Flink v1.12.3. On Thu, Mar 17, 2022 at 3:16 PM Dan Hill wrote: > Hi. I have an operator that Flink chained together with some side output > operators. Even though the main output of the operator goes to another > operator, the "Records Sent" metric is stil

Flink UI - Operator Chaining - broken with "Records Sent"

2022-03-17 Thread Dan Hill
Hi. I have an operator that Flink chained together with some side output operators. Even though the main output of the operator goes to another operator, the "Records Sent" metric is still zero. I'd expect it to be the number of records of the main output (not the side sink operators). Is this

Re: Is State TTL possible with event-time characteristics ?

2022-01-16 Thread Dan Hill
he.org/flink/flink-docs-release-1.14/docs/dev/datastream/fault-tolerance/state/#state-time-to-live-ttl > [4] https://flink.apache.org/2019/05/19/state-ttl.html > > Best regards, > Jinzhong, Li > > > On 2022/01/10 04:39:43 Dan Hill wrote: > > Hi. Any updates on this? > >

Re: Is State TTL possible with event-time characteristics ?

2022-01-09 Thread Dan Hill
Hi. Any updates on this? How do people usually do TTLs? I want to add a backup TTL in case I have leaky state. On Wed, Jun 17, 2020 at 6:08 AM Andrey Zagrebin wrote: > Hi Arti, > > Any program can use State with TTL but the state can only expire in > processing time at the moment even if you

Re: Kryo EOFException: No more bytes left

2021-12-21 Thread Dan Hill
I was not able to reproduce it by re-running the same job with an updated kryo library. The join doesn't do anything special. On Sun, Dec 19, 2021 at 4:58 PM Dan Hill wrote: > I'll retry the job to see if it's reproducible. The serialized state is > bad so that run keeps failing. >

Re: Kryo EOFException: No more bytes left

2021-12-19 Thread Dan Hill
I'll retry the job to see if it's reproducible. The serialized state is bad so that run keeps failing. On Sun, Dec 19, 2021 at 4:28 PM Zhipeng Zhang wrote: > Hi Dan, > > Could you provide the code snippet such that we can reproduce the bug here? > > Dan Hill 于2021年12月20日周一 0

Kryo EOFException: No more bytes left

2021-12-19 Thread Dan Hill
Hi. I was curious if anyone else has hit this exception. I'm using the IntervalJoinOperator to two streams of protos. I registered the protos with a kryo serializer. I started hitting this issue which looks like the operator is trying to deserialize a bad set of bytes that it serialized. I'm

Re: CoGroupedStreams and disableAutoGeneratedUIDs

2021-12-13 Thread Dan Hill
s and assign > > uids > > * However the two maps are stateless and technically don’t need a uid > > > > What do you think? > > > > Thias > > > > *From:*Dan Hill > > *Sent:* Montag, 13. Dezember 2021 06:30 > > *To:* user > > *S

CoGroupedStreams and disableAutoGeneratedUIDs

2021-12-12 Thread Dan Hill
Hi. I tried to use CoGroupedStreams w/ disableAutoGeneratedUIDs. CoGroupedStreams creates two map operators without the ability to set uids on them. These appear as "Map" in my operator graph. I noticed that the CoGroupedStreams.apply function has two map calls without setting uids. If I try

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
s ID to your code. > > On Wed, Dec 8, 2021 at 7:52 AM Dan Hill wrote: > >> Nothing changed (as far as I know). It's the same binary and the same >> args. It's Flink v1.12.3. I'm going to switch away from auto-gen uids and >> see if that helps. The job randomly started

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
> Somehow something must have changed in your job: Did you change the Flink > version? > > Hope this helps! > > On Wed, Dec 8, 2021 at 5:49 AM Dan Hill wrote: > >> I'm restoring the job with the same binary and same flags/args. >> >> On Tue, Dec 7, 2021 at 8:48

Re: Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
I'm restoring the job with the same binary and same flags/args. On Tue, Dec 7, 2021 at 8:48 PM Dan Hill wrote: > Hi. I noticed this warning has "operator > 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an > operator uid or name. It looks like so

Weird operator ID check restore from checkpoint fails

2021-12-07 Thread Dan Hill
Hi. I noticed this warning has "operator 811d3b279c8b26ed99ff0883b7630242" in it. I assume this should be an operator uid or name. It looks like something else. What is it? Is something corrupted? org.apache.flink.runtime.client.JobInitializationException: Could not instantiate JobManager.

Does Flink ever delete any sink S3 files?

2021-11-30 Thread Dan Hill
Hi. I'm debugging an issue with a system that listens for files written to S3. I'm assuming Flink does not modify sink objects after they've been written. I've seen a minicluster test locally write a ".part-" file. I wanted to double check to make sure S3 sinks won't write and delete files.

Is there a way to print key and state metadata/types for a job?

2021-11-25 Thread Dan Hill
I'm trying to track down a couple errors I've hit related to key groups. I want to verify that all of my keys have stable hashes. I tried to print out the execution plan but it doesn't contain enough info.

Re: s3.entropy working locally but not in production

2021-11-13 Thread Dan Hill
ller > than state.storage.fs.memory-threshold is written with the checkpoint > metadata) won't have entropy injected. > > David > > On Sat, Nov 13, 2021 at 2:33 AM Dan Hill wrote: > >> It seems to work for some jobs and not for others. Maybe jobs with >> little or empt

Re: s3.entropy working locally but not in production

2021-11-12 Thread Dan Hill
It seems to work for some jobs and not for others. Maybe jobs with little or empty state don't have _entropy_ swapped out correctly? On Fri, Nov 12, 2021 at 5:31 PM Dan Hill wrote: > Hi. My config I was able to verify my configs work locally with using > minio. When I have the sam

s3.entropy working locally but not in production

2021-11-12 Thread Dan Hill
Hi. My config I was able to verify my configs work locally with using minio. When I have the same code deployed to prod, the entropy key is not replaced. Any ideas? My logs are showing the correct entropy key. Configs: state.checkpoints.dir: s3a://my-flink-state/_entropy_/checkpoints

Re: Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-24 Thread Dan Hill
[2] > https://github.com/apache/flink/blob/master/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/MultipleInputStreamOperator.java > > > Dan Hill 于2021年10月22日周五 下午1:34写道: > >> Probably worth restating. I was hoping to avoid a lot of shuff

Re: Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-21 Thread Dan Hill
task managers depending on the operator (even if parallelism is the same)? It's also weird the output looks fine as is and fails on rescaling partitions. On Thu, Oct 21, 2021 at 10:23 PM Dan Hill wrote: > I found the following link about this. Still looks applicable. In my > case, I

Re: Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-21 Thread Dan Hill
I found the following link about this. Still looks applicable. In my case, I don't need to do a broadcast join. https://www.alibabacloud.com/blog/flink-how-to-optimize-sql-performance-using-multiple-input-operators_597839 On Thu, Oct 21, 2021 at 9:51 PM Dan Hill wrote: > Interesting. Tha

Re: Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-21 Thread Dan Hill
reams(env) >>.transform(transform) >>.addSink(resultSink); >> >> > I would invite @Piotr to double check this conclusion. He is more > professional on this topic. > > @Piotr, Would you please check Dan's question? Please correct me if I'm > wrong. >

Re: Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-15 Thread Dan Hill
re getting closer to a solution >> >> >> >> >> >> Thias >> >> >> >> >> >> >> >> *From:* Schwalbe Matthias >> *Sent:* Freitag, 15. Oktober 2021 08:49 >> *To:* 'Dan Hill' ; user >> *Subjec

Re: Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-14 Thread Dan Hill
ns" ? > Do you change max_parallellism or Partitioner strategy? > Besides, does this problem always happen, or does it happen occasionally > when you restore from the savepoint? > Would you please provide the code to reproduce the code? > > Best, > JING ZHANG > > D

Any issues with reinterpretAsKeyedStream when scaling partitions?

2021-10-14 Thread Dan Hill
I have a job that uses reinterpretAsKeyedStream across a simple map to avoid a shuffle. When changing the number of partitions, I'm hitting an issue with registerEventTimeTimer complaining that "key group from 110 to 119 does not contain 186". I'm using Flink v1.12.3. Any thoughts on this? I

Re: Helper methods for catching unexpected key changes?

2021-10-08 Thread Dan Hill
> I assume your exception happens in keyedProcessFunction2? > > > > reinterpretAsKeyedStream makes sense if you want to chain > keyedProcessFunction1 and keyedProcessFunction2, otherwise keyBy() will do … > > > > I hope these hints help, otherwise feel free to get back

Helper methods for catching unexpected key changes?

2021-10-07 Thread Dan Hill
Hi. I'm getting the following errors when using reinterpretAsKeyedStream. I don't expect the key to change for rows in reinterpretAsKeyedStream. Are there any utilities that I can use that I can use with reinterpetAsKeyedStream to verify that the key doesn't change? E.g. some wrapper operator?

Re: Questions about keyed streams

2021-09-28 Thread Dan Hill
So I'd probably give that a try > first and convert the Table to DataStream where needed. > > On Sat, Jul 24, 2021 at 9:22 PM Dan Hill wrote: > >> Thanks Fabian and Senhong! >> >> Here's an example diagram of the join that I want to do. There are more >> laye

Re: byte array as keys in Flink

2021-09-24 Thread Dan Hill
keep us posted if that works  > > > > Thias > > > > > > *From:* Guowei Ma > *Sent:* Freitag, 24. September 2021 09:34 > *To:* Caizhi Weng > *Cc:* Dan Hill ; user > *Subject:* Re: byte array as keys in Flink > > > > Hi Hill > > > > As far

byte array as keys in Flink

2021-09-23 Thread Dan Hill
*Context* I want to perform joins based on UUIDs. String version is less efficient so I figured I should use the byte[] version. I did a shallow dive into the Flink code I'm not sure it's safe to use byte[] as a key (since it uses object equals/hashcode). *Request* How do other Flink devs do

Observability tools on top of Flink

2021-09-22 Thread Dan Hill
Hi! I saw a recent Medium article

Invalid flink-config keeps going and ignores bad config values

2021-09-17 Thread Dan Hill
Hi. I noticed my flink-config.yaml had an error in it. I assumed a bad config would stop Flink from running (to catch errors earlier). Is there a way I can enable a strict parsing mode so any Flink parsing issue causes Flink to fail? I don't see one when looking at the code. 2021-09-17

Re: Migrating Kafka Sources (major version change)

2021-07-29 Thread Dan Hill
Are there any docs that talk about the new idleness support? I want to understand it better. https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/ On Thu, Jul 29, 2021 at 6:15 PM Dan Hill wrote: > Thanks, JING and Arvid! > > Interesting. Th

Re: Migrating Kafka Sources (major version change)

2021-07-29 Thread Dan Hill
w topic from >> earliest_offset because new topic name is different the previous one, so >> those KafkaTopicPartition could not be found in restored state. >> And restored state would be overwritten with new Kafka topic and offsets >> after a checkpoint. >

Migrating Kafka Sources (major version change)

2021-07-24 Thread Dan Hill
Hi! *Scenario* I want to eventually do a breaking change to a Kafka source (major version change) which requires a new Kafka topic. *Question* What utilities exist to help with this in Flink? What best practices exist? My plan is roughly the following: 1. Change my Flink job to support both

Re: Questions about keyed streams

2021-07-24 Thread Dan Hill
ore information. > > Best, > Senhong > > Sent with a Spark <https://sparkmailapp.com/source?from=signature> > On Jul 22, 2021, 7:33 AM +0800, Dan Hill , wrote: > > Hi. > > 1) If I use the same key in downstream operators (my key is a user id), > will the ro

Questions about keyed streams

2021-07-21 Thread Dan Hill
Hi. 1) If I use the same key in downstream operators (my key is a user id), will the rows stay on the same TaskManager machine? I join in more info based on the user id as the key. I'd like for these to stay on the same machine rather than shuffle a bunch of user-specific info to multiple task

Re: Kafka data sources, multiple interval joins and backfilling

2021-07-21 Thread Dan Hill
e previous join > finished > > BTW: Please double check you use interval join instead of regular join, > this would happen if compare two field with regular timestamp type in join > condition instead of time attribute. > > Best, > JING ZHANG > > Dan Hill 于2021年7月21

Kafka data sources, multiple interval joins and backfilling

2021-07-20 Thread Dan Hill
Hi. My team's flink job has cascading interval joins. The problem I'm outlining below is fine when streaming normally. It's an issue with backfills. We've been running into a bunch of backfills to evaluate the job over older data. When running as backfills, I've noticed that sometimes one of

Re: Watermark UI after checkpoint failure

2021-07-20 Thread Dan Hill
Wysakowicz wrote: > Do you mean a failed checkpoint, or do you mean that it happens after a > restore from a checkpoint? If it is the latter then this is kind of > expected, as watermarks are not checkpointed and they need to be > repopulated again. > > Best, > > Dawid > >

Watermark UI after checkpoint failure

2021-07-18 Thread Dan Hill
After my dev flink job hits a checkpoint failure (e.g. timeout) and then has successful checkpoints, the flink job appears to be in a bad state. E.g. some of the operators that previously had a watermark start showing "no watermark". The jobs proceed very slowly. Is there documentation for this

Re: savepoint failure

2021-07-13 Thread Dan Hill
Could this be caused by mixing of configuration settings when running? Running a job with one parallelism, stop/savepointing and then recovering with a different parallelism? I'd assume that's fine and wouldn't put create bad state. On Tue, Jul 13, 2021 at 12:34 PM Dan Hill wrote: > I chec

Re: savepoint failure

2021-07-13 Thread Dan Hill
I checked my code. Our keys for streams and map state only use either (1) string, (2) long IDs that don't change or (3) Tuple of 1 and 2. I don't know why my current case is breaking. Our job partitions and parallelism settings have not changed. On Tue, Jul 13, 2021 at 12:11 PM Dan Hill

Re: savepoint failure

2021-07-13 Thread Dan Hill
Hey. I just hit a similar error in production when trying to savepoint. We also use protobufs. Has anyone found a better fix to this? On Fri, Oct 23, 2020 at 5:21 AM Till Rohrmann wrote: > Glad to hear that you solved your problem. Afaik Flink should not read the > fields of messages and call

Advancing watermark with low-traffic Kafka partitions

2021-06-19 Thread Dan Hill
Hi. I remember listening to Flink Forward talks about strategies for handling low-traffic Kafka partitions. One of the talks showed a technique for sending events to advance the watermarks through most levels of the event flow (Kafka source through the sinks). One of the talks mentioned there

Re: Diagnosing bottlenecks in Flink jobs

2021-06-17 Thread Dan Hill
set multiple > parallelism for the downstream operators. But this way would introduce > extra cpu cost for serialize/deserialize and extra network cost for shuffle > data. I'm not sure the benefits of this method can offset the additional > costs. > > Best, > JING ZHANG >

Re: Diagnosing bottlenecks in Flink jobs

2021-06-16 Thread Dan Hill
ind > something abnormal about the job. > > Best, > JING ZHANG > > Dan Hill 于2021年6月17日周四 下午12:44写道: > >> We have a job that has been running but none of the AWS resource metrics >> for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 >> cores allo

Diagnosing bottlenecks in Flink jobs

2021-06-16 Thread Dan Hill
We have a job that has been running but none of the AWS resource metrics for the EKS, EC2, MSK and EBS show any bottlenecks. I have multiple 8 cores allocated but only ~2 cores are used. Most of the memory is not consumed. MSK does not show much use. EBS metrics look mostly idle. I assumed

Re: Checkpoint is timing out - inspecting state

2021-06-16 Thread Dan Hill
/docs/ops/monitoring/checkpoint_monitoring/ > > ------Original Mail -- > *Sender:*Dan Hill > *Send Date:*Sat Jun 12 09:15:50 2021 > *Recipients:*user > *Subject:*Checkpoint is timing out - inspecting state > >> Hi. >> >> We're doing som

Checkpoint is timing out - inspecting state

2021-06-11 Thread Dan Hill
Hi. We're doing something bad with our Flink state. We just launched a feature that creates very big values (lists of objects that we append to) in MapState. Our checkpoints time out (10 minutes). I'm assuming the values are too big. Backpressure is okay and cpu+memory metrics look okay.

Re: Does WatermarkStrategy.withIdleness work?

2021-06-01 Thread Dan Hill
events have been emitted. Since we're just using this for local development, it's fine. On Fri, Mar 12, 2021 at 1:55 AM Dan Hill wrote: > Thanks David! > > On Fri, Mar 12, 2021, 01:54 David Anderson wrote: > >> WatermarkStrategy.withIdleness works by marking idle s

Re: Flink missing Kafka records

2021-04-29 Thread Dan Hill
a previously idle partition with an event timestamp before the > watermark of the other partitions, that record would be deemed late and is > discarded. > > On Tue, Apr 27, 2021 at 2:42 AM Dan Hill wrote: > >> Hey Robert. >> >> Nothing weird. I was trying to find recent records

Re: Checkpoint error - "The job has failed"

2021-04-28 Thread Dan Hill
1.11.1. > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > ------ > *From:* Dan Hill > *Sent:* Tuesday, April 27, 2021 7:50 > *To:* Yun Tang > *Cc:* Robert Metzger ; user > *Subject:* Re: Checkpoint error - "The job has

Re: Flink missing Kafka records

2021-04-26 Thread Dan Hill
"the first records" or the "latest records" > missing? Any individual records missing, or larger blocks of data? > > I don't think that there's a bug in Flink or the Kafka connector. Maybe > its just a configuration or systems design issue. > > > On Sun, Apr 2

Re: Checkpoint error - "The job has failed"

2021-04-26 Thread Dan Hill
.3. > > > [1] https://issues.apache.org/jira/browse/FLINK-16753 > > Best > Yun Tang > -- > *From:* Robert Metzger > *Sent:* Monday, April 26, 2021 14:46 > *To:* Dan Hill > *Cc:* user > *Subject:* Re: Checkpoint error - "T

Checkpoint error - "The job has failed"

2021-04-25 Thread Dan Hill
My Flink job failed to checkpoint with a "The job has failed" error. The logs contained no other recent errors. I keep hitting the error even if I cancel the jobs and restart them. When I restarted my jobmanager and taskmanager, the error went away. What error am I hitting? It looks like

Flink missing Kafka records

2021-04-25 Thread Dan Hill
Hi! Have any other devs noticed issues with Flink missing Kafka records with long-running Flink jobs? When I re-run my Flink job and start from the earliest Kafka offset, Flink processes the events correctly. I'm using Flink v1.11.1. I have a simple job that takes records (Requests) from Kafka

Re: Does WatermarkStrategy.withIdleness work?

2021-03-12 Thread Dan Hill
2] > https://github.com/aljoscha/flink/blob/6e4419e550caa0e5b162bc0d2ccc43f6b0b3860f/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/timestamps/ProcessingTimeTrailingBoundedOutOfOrdernessTimestampExtractor.java > > On Fri, Mar 12, 2021 at 9:47 AM Dan Hill w

Does WatermarkStrategy.withIdleness work?

2021-03-12 Thread Dan Hill
I haven't been able to get WatermarkStrategy.withIdleness to work. Is it broken? None of my timers trigger when I'd expect idleness to take over. On Tue, Mar 2, 2021 at 11:15 PM Dan Hill wrote: > Hi. > > For local and tests development, I want to flush the events in my system >

Re: Gradually increasing checkpoint size

2021-03-12 Thread Dan Hill
hink it's where you pointed. > I'd suggest to first investigate the progress of watermarks. > > Best, > > Dawid > On 09/03/2021 08:36, Dan Hill wrote: > > Hi Yun! > > That advice was useful. The state for that operator is very small > (31kb). Most of the checkpoin

Re: Best practices for complex state manipulation

2021-03-10 Thread Dan Hill
ommend starting from this webinar > done by my colleague Seth Weismann: > https://www.youtube.com/watch?v=9GF8Hwqzwnk. > > Cheers, > Gordon > > On Wed, Mar 10, 2021 at 1:58 AM Dan Hill wrote: > >> Hi! >> >> I'm working on a join setup that does fuzzy match

Best practices for complex state manipulation

2021-03-09 Thread Dan Hill
Hi! I'm working on a join setup that does fuzzy matching in case the client does not send enough parameters to join by a foreign key. There's a few ways I can store the state. I'm curious about best practices around this. I'm using rocksdb as the state storage. I was reading the code for

Re: Re: Gradually increasing checkpoint size

2021-03-08 Thread Dan Hill
a `alreadyOutputed` value state, which seems to keep > increasing if there are always new keys ? > > Best, > Yun > > > ------Original Mail -- > *Sender:*Dan Hill > *Send Date:*Tue Mar 9 00:59:24 2021 > *Recipients:*Yun Gao > *CC

Re: Gradually increasing checkpoint size

2021-03-08 Thread Dan Hill
k. > > Best, > Yun > > > --Original Mail -- > *Sender:*Dan Hill > *Send Date:*Mon Mar 8 14:59:48 2021 > *Recipients:*user > *Subject:*Gradually increasing checkpoint size > >> Hi! >> >> I'm running a backfill Flink st

Gradually increasing checkpoint size

2021-03-07 Thread Dan Hill
Hi! I'm running a backfill Flink stream job over older data. It has multiple interval joins. I noticed my checkpoint is regularly gaining in size. I'd expect my checkpoints to stabilize and not grow. Is there a setting to prune useless data from the checkpoint? My top guess is that my

Re: Debugging long Flink checkpoint durations

2021-03-04 Thread Dan Hill
The checkpoint was only acknowledged shortly after it was started. On Thu, Mar 4, 2021 at 12:38 PM Dan Hill wrote: > I dove deeper into it and made a little more progress (by giving more > resources). > > Here is a screenshot of one bottleneck: > https://drive.go

Re: Debugging long Flink checkpoint durations

2021-03-04 Thread Dan Hill
/flink-docs-release-1.11/ops/memory/mem_setup_tm.html On Tue, Mar 2, 2021 at 3:45 PM Dan Hill wrote: > Thanks! Yes, I've looked at these. My job is facing backpressure > starting at an early join step. I'm unclear if more time is fine for the > backfill or if I need more

Flink, local development, finish processing a stream of Kafka data

2021-03-02 Thread Dan Hill
Hi. For local and tests development, I want to flush the events in my system to make sure I'm processing everything. My watermark does not progress to finish all of the data. What's the best practice for local development or tests? If I use idle sources for 1 Kafka partition, this appears

Re: Debugging long Flink checkpoint durations

2021-03-02 Thread Dan Hill
Thanks! Yes, I've looked at these. My job is facing backpressure starting at an early join step. I'm unclear if more time is fine for the backfill or if I need more resources. On Tue, Mar 2, 2021 at 12:50 AM Yun Gao wrote: > Hi Dan, > > I think you could see the detail of the checkpoints

Debugging long Flink checkpoint durations

2021-03-01 Thread Dan Hill
Hi. Are there good ways to debug long Flink checkpoint durations? I'm running a backfill job that runs ~10 days of data and then starts checkpointing failing. Since I only see the last 10 checkpoints in the jobmaster UI, I don't see when it starts. I looked through the text logs and didn't see

Best practices around checkpoint intervals and sizes?

2021-02-17 Thread Dan Hill
Hi. I'm playing around with optimizing our checkpoint intervals and sizes. Are there any best practices around this? I have a ~7 sequential joins and a few sinks. I'm curious what would result in the better throughput and latency trade offs. I'd assume less frequent checkpointing would

Re: Optimizing Flink joins

2021-02-12 Thread Dan Hill
: > Hi Dan, > > thanks for letting us know. Could you give us some feedback what is > missing in SQL for this use case? Are you looking for some broadcast > joining or which kind of algorithm would help you? > > Regards, > Timo > > On 11.02.21 20:32, Dan Hill wrote:

Re: Optimizing Flink joins

2021-02-11 Thread Dan Hill
OM c, b, a").explain()); > > So you can reorder the tables in the query if that improves performance. > For interval joins, we currently don't provide additional algorithms or > options. > > Regards, > Timo > > On 11.02.21 05:04, Dan Hill wrote: > > Hi! I was curious if ther

  1   2   >