Re: Streaming join performance

2023-08-14 Thread Alexey Novakov via user
Привет Артем! Are your tables backed by Kafka? If - yes, what if you use upsert-kafka connector from Table API , does it help to reduce the number of records in each subsequent join operator? I wrote a blog-p

Re: Streaming join performance

2023-08-08 Thread liu ron
Hi, David Regarding the N-way join, this feature aims to address the issue of state simplification, it is on the roadmap. Technically there are no limitations, but we'll need some time to find a sensible solution. Best, Ron David Anderson 于2023年8月9日周三 10:38写道: > This join optimization sounds p

Re: Streaming join performance

2023-08-08 Thread David Anderson
This join optimization sounds promising, but I'm wondering why Flink SQL isn't taking advantage of the N-Ary Stream Operator introduced in FLIP-92 [1][2] to implement a n-way join in a single operator. Is there something that makes this impossible/impractical? [1] https://cwiki.apache.org/confluen

RE: Streaming join performance

2023-08-05 Thread shuai xu
Hi, we are also paying attention to this issue and have completed the validation of the minibatch join optimization including the intermediate message folding you mentioned. We plan to officially release it in Flink 1.19. This optimization could significantly improves the performance of join op

Re: streaming mode with both finite and infinite input sources

2022-02-25 Thread Dawid Wysakowicz
This should be supported in 1.14 if you enable checkpointing with finished tasks[1], which has been added in 1.14. In 1.15 it will be enabled by default. Best, Dawid [1] https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/fault-tolerance/checkpointing/#enabling-and-confi

Re: streaming mode with both finite and infinite input sources

2022-02-25 Thread Yuval Itzchakov
One possible option is to look into the hybrid source released in Flink 1.14 to support your use-case: https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/hybridsource/ On Fri, Feb 25, 2022, 09:19 Jin Yi wrote: > so we have a streaming job where the main work t

RE: Streaming SQL support for redis streaming connector

2021-09-16 Thread Osada Paranaliyanage
Thanks, will have a look through! -Original Message- From: Yangze Guo Sent: Wednesday, September 15, 2021 11:25 AM To: Osada Paranaliyanage Cc: David Morávek ; user@flink.apache.org Subject: Re: Streaming SQL support for redis streaming connector [EXTERNAL EMAIL] This email has been

RE: Streaming SQL support for redis streaming connector

2021-09-16 Thread Osada Paranaliyanage
Hi Leonard, That’s awesome news. We are actually using documentdb. Any idea how much work it will be to make it work with documentdb instead? Thanks, Osada. From: Leonard Xu Sent: Wednesday, September 15, 2021 1:08 PM To: Osada Paranaliyanage Cc: user@flink.apache.org Subject: Re

Re: Streaming SQL support for redis streaming connector

2021-09-15 Thread Leonard Xu
Hi, Osada Just want to offer some material here.The flink-cdc-connectors project [1] maybe also help you, we supports the document db MongoDB[2] recently. Best, Leonard [1] https://github.com/ververica/flink-cdc-connectors [2] https://verve

Re: Streaming SQL support for redis streaming connector

2021-09-14 Thread Yangze Guo
use > as redis) > > > > Thanks, > > Osada. > > > > From: David Morávek > Sent: Tuesday, September 14, 2021 7:53 PM > To: Osada Paranaliyanage > Cc: user@flink.apache.org > Subject: Re: Streaming SQL support for redis streaming connector > > > > [EXTER

RE: Streaming SQL support for redis streaming connector

2021-09-14 Thread Osada Paranaliyanage
the WITH clause as redis) Thanks, Osada. From: David Morávek Sent: Tuesday, September 14, 2021 7:53 PM To: Osada Paranaliyanage Cc: user@flink.apache.org Subject: Re: Streaming SQL support for redis streaming connector [EXTERNAL EMAIL] This email has been received from an external source

Re: Streaming SQL support for redis streaming connector

2021-09-14 Thread David Morávek
Hi Osada, in theory building a Redis table from "CDC stream" should definitely be doable. Unfortunately Flink currently doesn't have any official Redis Sink for the Table API and there is currently no on-going effort for adding it, so it would need to be implemented first. The resulting syntax wou

Re: Streaming Patterns and Best Practices - featuring Apache Flink

2021-09-10 Thread Timo Walther
Thanks for sharing this with us Devin. If you haven't considered it already, maybe this could also be something for next Flink Forward? Regards, Timo On 02.09.21 21:02, Devin Bost wrote: I just released a new video that features Apache Flink in several design patterns: Streaming Patterns an

Re: streaming file sink OUT metrics not populating

2021-06-09 Thread Arvid Heise
For reference, the respective FLIP shows the ideas [1]. It's on our agenda for 1.14. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-33%3A+Standardize+Connector+Metrics On Thu, Jun 3, 2021 at 6:41 PM Chesnay Schepler wrote: > This is a known issue, and cannot be fixed on the user sid

Re: streaming file sink OUT metrics not populating

2021-06-03 Thread Chesnay Schepler
This is a known issue, and cannot be fixed on the user side. The underlying problem is that this needs to be implemented separately for each source/sink and we haven't gotten around to doing that yet, but some progress is being made for 1.14 to that end. On 6/3/2021 6:06 PM, Vijayendra Yadav w

Re: Streaming File Sink cannot generate _SUCCESS tag files

2020-10-19 Thread highfei2011
Hi, Jingsong Lee Thanks for taking the time to respond to the email, I will try following your suggestion. Best, Yang 在 2020年10月19日 11:56,Jingsong Li 写道: Hi, Yang, "SUCCESSFUL_JOB_OUTPUT_DIR_MARKER" does not work in StreamingFileSink. You can take a look to partition commit fea

Re: Streaming File Sink cannot generate _SUCCESS tag files

2020-10-18 Thread Jingsong Li
Hi, Yang, "SUCCESSFUL_JOB_OUTPUT_DIR_MARKER" does not work in StreamingFileSink. You can take a look to partition commit feature [1], [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/table/connectors/filesystem.html#partition-commit Best, Jingsong Lee On Thu, Oct 15, 2020 a

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-12 Thread Timo Walther
Hi Austin, your explanation for the KeyedProcessFunction implementation sounds good to me. Using the time and state primitives for this task will make the implementation more explicit but also more readable. Let me know if you could solve your use case. Regards, Timo On 09.10.20 17:27, Aus

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-09 Thread Austin Cawley-Edwards
Hey Timo, Hah, that's a fair point about using time. I guess I should update my statement to "as a user, I don't want to worry about *manually managing* time". That's a nice suggestion with the KeyedProcessFunction and no windows, I'll give that a shot. If I don't want to emit any duplicates, I'd

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-09 Thread Timo Walther
Hi Austin, if you don't want to worry about time at all, you should probably not use any windows because those are a time-based operation. A solution that would look a bit nicer could be to use a pure KeyedProcessFunction and implement the deduplication logic without reusing windows. In Proc

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-08 Thread Austin Cawley-Edwards
Hey Timo, Sorry for the delayed reply. I'm using the Blink planner and using non-time-based joins. I've got an example repo here that shows my query/ setup [1]. It's got the manual timestamp assignment commented out for now, but that does indeed solve the issue. I'd really like to not worry about

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-05 Thread Timo Walther
Btw which planner are you using? Regards, Timo On 05.10.20 10:23, Timo Walther wrote: Hi Austin, could you share some details of your SQL query with us? The reason why I'm asking is because I guess that the rowtime field is not inserted into the `StreamRecord` of DataStream API. The rowtime

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-05 Thread Timo Walther
Hi Austin, could you share some details of your SQL query with us? The reason why I'm asking is because I guess that the rowtime field is not inserted into the `StreamRecord` of DataStream API. The rowtime field is only inserted if there is a single field in the output of the query that is a

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-05 Thread Till Rohrmann
Hi Austin, thanks for offering to help. First I would suggest asking Timo whether this is an aspect which is still missing or whether we overlooked it. Based on that we can then take the next steps. Cheers, Till On Fri, Oct 2, 2020 at 7:05 PM Austin Cawley-Edwards < austin.caw...@gmail.com> wrot

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-02 Thread Austin Cawley-Edwards
Hey Till, Thanks for the notes. Yeah, the docs don't mention anything specific to this case, not sure if it's an uncommon one. Assigning timestamps on conversion does solve the issue. I'm happy to take a stab at implementing the feature if it is indeed missing and you all think it'd be worthwhile.

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-02 Thread Till Rohrmann
Hi Austin, yes, it should also work for ingestion time. I am not entirely sure whether event time is preserved when converting a Table into a retract stream. It should be possible and if it is not working, then I guess it is a missing feature. But I am sure that @Timo Walther knows more about it

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-01 Thread Austin Cawley-Edwards
Hey Till, Just a quick question on time characteristics -- this should work for IngestionTime as well, correct? Is there anything special I need to do to have the CsvTableSource/ toRetractStream call to carry through the assigned timestamps, or do I have to re-assign timestamps during the conversi

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-01 Thread Austin Cawley-Edwards
Perfect, thanks so much Till! On Thu, Oct 1, 2020 at 5:13 AM Till Rohrmann wrote: > Hi Austin, > > I believe that the problem is the processing time window. Unlike for event > time where we send a MAX_WATERMARK at the end of the stream to trigger all > remaining windows, this does not happen for

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-10-01 Thread Till Rohrmann
Hi Austin, I believe that the problem is the processing time window. Unlike for event time where we send a MAX_WATERMARK at the end of the stream to trigger all remaining windows, this does not happen for processing time windows. Hence, if your stream ends and you still have an open processing tim

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-09-30 Thread Austin Cawley-Edwards
Hey all, Thanks for your patience. I've got a small repo that reproduces the issue here: https://github.com/austince/flink-1.10-sql-windowing-error Not sure what I'm doing wrong but it feels silly. Thanks so much! Austin On Tue, Sep 29, 2020 at 3:48 PM Austin Cawley-Edwards < austin.caw...@gmai

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-09-29 Thread Austin Cawley-Edwards
Hey Till, Thanks for the reply -- I'll try to see if I can reproduce this in a small repo and share it with you. Best, Austin On Tue, Sep 29, 2020 at 3:58 AM Till Rohrmann wrote: > Hi Austin, > > could you share with us the exact job you are running (including the > custom window trigger)? Thi

Re: Streaming SQL Job Switches to FINISHED before all records processed

2020-09-29 Thread Till Rohrmann
Hi Austin, could you share with us the exact job you are running (including the custom window trigger)? This would help us to better understand your problem. I am also pulling in Klou and Timo who might help with the windowing logic and the Table to DataStream conversion. Cheers, Till On Mon, S

Re: Streaming data to parquet

2020-09-14 Thread Senthil Kumar
Arvid, Jan and Ayush, Thanks for the ideas! -Kumar From: Jan Lukavský Date: Monday, September 14, 2020 at 6:23 AM To: "user@flink.apache.org" Subject: Re: Streaming data to parquet Hi, I'd like to mention another approach, which might not be as "flinkish", but rem

Re: Streaming data to parquet

2020-09-14 Thread Jan Lukavský
pache.org>> *Cc: *Marek Maj mailto:marekm...@gmail.com>>, user mailto:user@flink.apache.org>> *Subject: *Re: Streaming data to parquet Hi, Looking at the problem broadly, file size is directly tied up with how often you commit. No matter which system you use, this

Re: Streaming data to parquet

2020-09-14 Thread Arvid Heise
; > > *From: *Ayush Verma > *Date: *Friday, September 11, 2020 at 8:14 AM > *To: *Robert Metzger > *Cc: *Marek Maj , user > *Subject: *Re: Streaming data to parquet > > > > Hi, > > > > Looking at the problem broadly, file size is directly tied up with

Re: Streaming data to parquet

2020-09-11 Thread Senthil Kumar
appreciate any ideas etc. Cheers Kumar From: Ayush Verma Date: Friday, September 11, 2020 at 8:14 AM To: Robert Metzger Cc: Marek Maj , user Subject: Re: Streaming data to parquet Hi, Looking at the problem broadly, file size is directly tied up with how often you commit. No matter which

Re: Streaming data to parquet

2020-09-11 Thread Ayush Verma
Hi, Looking at the problem broadly, file size is directly tied up with how often you commit. No matter which system you use, this variable will always be there. If you commit frequently, you will be close to realtime, but you will have numerous small files. If you commit after long intervals, you

Re: Streaming data to parquet

2020-09-11 Thread Robert Metzger
Hi Marek, what you are describing is a known problem in Flink. There are some thoughts on how to address this in https://issues.apache.org/jira/browse/FLINK-11499 and https://issues.apache.org/jira/browse/FLINK-17505 Maybe some ideas there help you already for your current problem (use long checkp

Re: streaming restored state after restart

2020-06-13 Thread Till Rohrmann
Hi Adam, you could have a local field in your operator which you initialize with true on creation. Then in the map or process function you could check this field and if true output all buffered state. At last you only need to set this field to false so that any subsequent call to the map/process f

Re: Streaming multiple csv files

2020-05-29 Thread Robert Metzger
Hi Nikola, you could implement a custom SourceFunction that implements this in some way: If the files are small (< 10 MB) send each file as a record, then process it in a subsequent flatMap operation. If the files are large, split the work across the parallel sources and read them serially in the

Re: Streaming Job eventually begins failing during checkpointing

2020-04-27 Thread Yu Li
would it be possible to >>>> create so many operator states? Did you configure some parameters wrongly? >>>> >>>> >>>> [1] >>>> https://github.com/apache/beam/blob/4fc924a8193bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/ap

Re: Streaming Job eventually begins failing during checkpointing

2020-04-25 Thread Eleanore Jin
b9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/BufferingDoFnRunner.java#L95 >>> >>> Best >>> Yun Tang >>> -- >>> *From:* Stephen Patel >&g

Re: Streaming Job eventually begins failing during checkpointing

2020-04-23 Thread Stephan Ewen
5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/BufferingDoFnRunner.java#L95 >> >> Best >> Yun Tang >> ------ >> *From:* Stephen Patel >> *Sent:* Thursday, April 16, 2020 22:30 >&g

Re: Streaming Job eventually begins failing during checkpointing

2020-04-16 Thread Yun Tang
: Re: Streaming Job eventually begins failing during checkpointing Correction. I've actually found a place where it potentially might be creating a new operator state per checkpoint: https://github.com/apache/beam/blob/4fc924a8193bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/jav

Re: Streaming Job eventually begins failing during checkpointing

2020-04-16 Thread Stephen Patel
bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/BufferingDoFnRunner.java#L95 > > Best > Yun Tang > -- > *From:* Stephen Patel > *Sent:* Thursday, April 16, 2020 22:30

Re: Streaming Job eventually begins failing during checkpointing

2020-04-16 Thread Stephen Patel
Correction. I've actually found a place where it potentially might be creating a new operator state per checkpoint: https://github.com/apache/beam/blob/4fc924a8193bb9495c6b7ba755ced576bb8a35d5/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/stableinput/Buff

Re: Streaming Job eventually begins failing during checkpointing

2020-04-16 Thread Stephen Patel
I can't say that I ever call that directly. The beam library that I'm using does call it in a couple places: https://github.com/apache/beam/blob/v2.14.0/runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/io/UnboundedSourceWrapper.java#L422-L429 But it seems t

Re: Streaming Job eventually begins failing during checkpointing

2020-04-15 Thread Yun Tang
Hi Stephen This is not related with RocksDB but with default on-heap operator state backend. From your exception stack trace, you have created too many operator states (more than 32767). How do you call context.getOperatorStateStore().getListState or context.getOperatorStateStore().getBroadcas

Re: Streaming kafka data sink to hive

2020-03-19 Thread Jingsong Li
Hi wanglei, > 1 Is there any flink-hive-connector that i can use to write to hive streamingly? "Streaming kafka data sink to hive" is under discussion.[1] And POC work is ongoing.[2] We want to support it in release-1.11. > 2 Since HDFS is not friendly to frequently append and hive's data is s

Re: Streaming Files to S3

2019-11-28 Thread Arvid Heise
Hi Li, S3 file sink will write data into prefixes, with as many part-files as the degree of parallelism. This structure comes from the good ol' Hadoop days, where an output folder also contained part-files and is independent of S3. However, each of the part-files will be uploaded in a multipart fa

Re: Streaming data to Segment

2019-11-21 Thread Li Peng
Awesome, I'll definitely try that out, thanks! On Wed, Nov 20, 2019 at 9:36 PM Yuval Itzchakov wrote: > Hi Li, > > You're in the right direction. One additional step would be to use > RickSinkFunction[Data] instead of SinkFunction[Data] which exposes open and > close functions which allow you to

Re: Streaming data to Segment

2019-11-20 Thread Yuval Itzchakov
Hi Li, You're in the right direction. One additional step would be to use RickSinkFunction[Data] instead of SinkFunction[Data] which exposes open and close functions which allow you to initialize and dispose resources properly. On Thu, 21 Nov 2019, 5:23 Li Peng, wrote: > Hey folks, I'm interest

Re: Streaming File Sink - Parquet File Writer

2019-10-30 Thread Kostas Kloudas
Hi Vinay, You are correct when saying that the bulk formats only support onCheckpointRollingPolicy. The reason for this has to do with the fact that currently Flink relies on the Hadoop writer for Parquet. Bulk formats keep important details about how they write the actual data (such as compress

Re: Streaming write to Hive

2019-09-05 Thread Qi Luo
Hi JingsongLee, Fantastic! We'll look into it. Thanks, Qi On Fri, Sep 6, 2019 at 10:52 AM JingsongLee wrote: > Hi luoqi: > > With partition support[1], I want to introduce a FileFormatSink to > cover streaming exactly-once and partition-related logic for flink > file connectors and hive connec

Re: Streaming write to Hive

2019-09-05 Thread JingsongLee
Hi luoqi: With partition support[1], I want to introduce a FileFormatSink to cover streaming exactly-once and partition-related logic for flink file connectors and hive connector. You can take a look. [1] https://docs.google.com/document/d/15R3vZ1R_pAHcvJkRx_CWleXgl08WL3k_ZpnWSdzP7GY/edit?usp=

Re: Streaming write to Hive

2019-09-05 Thread Bowen Li
Hi, I'm not sure if there's one yet. Feel free to create one if not. On Wed, Sep 4, 2019 at 11:28 PM Qi Luo wrote: > Hi Bowen, > > Thank you for the information! Streaming write to Hive is a very common > use case for our users. Is there any open issue for this to which we can > try contributin

Re: Streaming write to Hive

2019-09-04 Thread Qi Luo
Hi Bowen, Thank you for the information! Streaming write to Hive is a very common use case for our users. Is there any open issue for this to which we can try contributing? +Yufei and Chang who are also interested in this. Thanks, Qi On Thu, Sep 5, 2019 at 12:16 PM Bowen Li wrote: > Hi Qi, >

Re: Streaming write to Hive

2019-09-04 Thread Bowen Li
Hi Qi, With 1.9 out of shelf, I'm afraid not. You can make HiveTableSink implements AppendStreamTableSink (an empty interface for now) so it can be picked up in streaming job. Also, streaming requires checkpointing, and Hive sink doesn't do that yet. There might be other tweaks you need to make.

Re: Streaming from a file

2019-08-01 Thread Zhu Zhu
Hi Vishwas, Not sure whether I understand your needs correctly. I think currently readTextFile(path) does return a DataStream. From the code it is emitting one line once it is read from the file, thus in a line-by-line streaming pattern. Thanks, Zhu Zhu Vishwas Siravara 于2019年8月1日周四 下午11:50写道:

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-27 Thread Rafi Aroch
Thanks Piotr & Kostas. Really looking forward to this :) Rafi On Wed, Mar 27, 2019 at 10:58 AM Piotr Nowojski wrote: > Hi Rafi, > > There is also an ongoing effort to support bounded streams in DataStream > API [1], which might provide the backbone for the functionalists that you > need. > >

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-27 Thread Piotr Nowojski
Hi Rafi, There is also an ongoing effort to support bounded streams in DataStream API [1], which might provide the backbone for the functionalists that you need. Piotrek [1] https://issues.apache.org/jira/browse/FLINK-11875 > On 25 Mar 2019

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-25 Thread Rafi Aroch
Hi Kostas, Thank you. I'm currently testing my job against a small file, so it's finishing before the checkpointing starts. But also if it was a larger file and checkpoint did happen, there would always be the tailing events starting after the last checkpoint until the source has finished. So woul

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Rafi Aroch
Hi Kostas, Yes I have. Rafi On Thu, Mar 21, 2019, 20:47 Kostas Kloudas wrote: > Hi Rafi, > > Have you enabled checkpointing for you job? > > Cheers, > Kostas > > On Thu, Mar 21, 2019 at 5:18 PM Rafi Aroch wrote: > >> Hi Piotr and Kostas, >> >> Thanks for your reply. >> >> The issue is that I

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Kostas Kloudas
Hi Rafi, Have you enabled checkpointing for you job? Cheers, Kostas On Thu, Mar 21, 2019 at 5:18 PM Rafi Aroch wrote: > Hi Piotr and Kostas, > > Thanks for your reply. > > The issue is that I don't see any committed files, only in-progress. > I tried to debug the code for more details. I see t

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Rafi Aroch
Hi Piotr and Kostas, Thanks for your reply. The issue is that I don't see any committed files, only in-progress. I tried to debug the code for more details. I see that in *BulkPartWriter* I do reach the *write* methods and see events getting written, but I never reach the *closeForCommit*. I reac

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Kostas Kloudas
Hi Rafi, Piotr is correct. In-progress files are not necessarily readable. The valid files are the ones that are "committed" or finalized. Cheers, Kostas On Thu, Mar 21, 2019 at 2:53 PM Piotr Nowojski wrote: > Hi, > > I’m not sure, but shouldn’t you be just reading committed files and ignore >

Re: Streaming Protobuf into Parquet file not working with StreamingFileSink

2019-03-21 Thread Piotr Nowojski
Hi, I’m not sure, but shouldn’t you be just reading committed files and ignore in-progress? Maybe Kostas could add more insight to this topic. Piotr Nowojski > On 20 Mar 2019, at 12:23, Rafi Aroch wrote: > > Hi, > > I'm trying to stream events in Prorobuf format into a parquet file. > I look

Re: Streaming Checkpoint - Could not materialize checkpoint Exception

2019-01-16 Thread sohimankotia
Hi Andrey, Yes .Setting setFailOnCheckpointingErrors(false) solved the problem. But in between I am getting this error : 2019-01-16 21:07:26,979 ERROR org.apache.flink.runtime.rest.handler.taskmanager.TaskManagerDetailsHandler - Implementation error: Unhandled exception. org.apache.flink.runti

Re: Streaming Checkpoint - Could not materialize checkpoint Exception

2019-01-16 Thread Andrey Zagrebin
Hi Sohi, Could it be that you configured your job tasks to fail if checkpoint fails (streamExecutionEnvironment.getCheckpointConfig().setFailOnCheckpointingErrors(true))? Could you send the complete job master log? If checkpoint 470 has been subsumed by 471, it could be that its directory is remo

Re: Streaming Checkpoint - Could not materialize checkpoint Exception

2019-01-16 Thread Till Rohrmann
Hi Sohimankotia, you can control Flink's failure behaviour in case of a checkpoint failure via the `ExecutionConfig#setFailTaskOnCheckpointError(boolean)`. Per default it is set to true which means that a Flink task will fail if a checkpoint error occurs. If you set it to false, then the job won't

Re: Streaming Checkpoint - Could not materialize checkpoint Exception

2019-01-15 Thread Congxian Qiu
Hi, Sohi You can check out doc[1][2] to find out the answer. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/stream/state/checkpointing.html#enabling-and-configuring-checkpointing [2] https://ci.apache.org/projects/flink/flink-docs-release-1.7/dev/restart_strategies.html sohim

Re: Streaming Checkpoint - Could not materialize checkpoint Exception

2019-01-15 Thread sohimankotia
Yes. File got deleted . 2019-01-15 10:40:41,360 INFO FSNamesystem.audit: allowed=true ugi=hdfs (auth:SIMPLE) ip=/192.168.3.184 cmd=delete src=/pipeline/job/checkpoints/e9a08c0661a6c31b5af540cf352e1265/chk-470/5fb3a899-8c0f-45f6-a847-42cbb71e6d19 dst=nullperm=null

Re: Streaming Checkpoint - Could not materialize checkpoint Exception

2019-01-15 Thread Congxian Qiu
Hi, Sohi Seems like the checkpoint file `hdfs:/pipeline/job/checkpoints/e9a08c0661a6c31b5af540cf352e1265/chk-470/5fb3a899-8c0f-45f6-a847-42cbb71e6d19` did not exist for some reason, you can check the life cycle of this file from hdfs audit log and find out why the file did not exist. maybe the chec

Re: Streaming to Parquet Files in HDFS

2018-10-11 Thread Averell
Hi Kostas, Thanks for the info. That error caused by I built your code along with not up-to-date baseline. I rebased my branch build, and there's no more such issue. I've been testing, and until now have some questions/issues as below: 1. I'm not able to write to S3 with the following URI format:

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi, Yes, please enable DEBUG to streaming to see all the logs also from the StreamTask. A checkpoint is “valid” as soon as it get acknowledged. As the documentation says, the job will restart from “ the last **successful** checkpoint” which is the most recent acknowledged one. Cheers, Kostas

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Averell
Hi Kostas, Yes, I set the level to DEBUG, but for the /org.apache.flink.streaming.api.functions.sink.filesystem.bucket/ only. Will try to enable for /org.apache.flink.streaming/. I just found one (possibly) issue with my build is that I had not used the latest master branch when merging with your

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi, I just saw that you have already set the level to DEBUG. These are all your DEBUG logs of the TM when running on YARN? Also did you try to wait a bit more to see if the acknowledgements of the checkpoints arrive a bit later? Checkpoints and acknowledgments are not necessarily aligned. Kost

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi Averell, Could you set your logging to DEBUG? This may shed some light on what is happening as it will contain more logs. Kostas > On Oct 7, 2018, at 11:03 AM, Averell wrote: > > Hi Kostas, > > I'm using a build with your PR. However, it seemed the issue is not with S3, > as when I tried t

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Averell
Hi Kostas, I'm using a build with your PR. However, it seemed the issue is not with S3, as when I tried to write to local file system (file:///, not HDFS), I also got the same problem - only the first part published. All remaining parts were in inprogress and had names prefixed with "." >From Fli

Re: Streaming to Parquet Files in HDFS

2018-10-07 Thread Kostas Kloudas
Hi Averell, From the logs, only checkpoint 2 was acknowledged (search for “eceived completion notification for checkpoint with id=“) and this is why no more files are finalized. So only checkpoint 2 was successfully completed. BTW you are using the PR you mentioned before or Flink 1.6? I am as

Re: Streaming to Parquet Files in HDFS

2018-10-06 Thread Averell
Hi Kostas, Please help ignore my previous email about the issue with security. It seems to I had mixed version of shaded and unshaded jars. However, I'm now facing another issue with writing parquet files: only the first part is closed. All the subsequent parts are kept in in-progress state forev

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Averell
Hi Kostas, I tried your PR - trying to write to S3 from Flink running on AWS, and I got the following error. I copied the three jar files flink-hadoop-fs-1.7-SNAPSHOT.jar, flink-s3-fs-base-1.7-SNAPSHOT.jar, flink-s3-fs-hadoop-1.7-SNAPSHOT.jar to lib/ directory. Do I need to make any change to HADO

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Averell
What a great news. Thanks for that, Kostas. Regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Kostas Kloudas
Hi Averell, There is no such “out-of-the-box” solution, but there is an open PR for adding S3 support to the StreamingFileSink [1]. Cheers, Kostas [1] https://github.com/apache/flink/pull/6795 > On Oct 5, 2018, at 11:14 AM, Averell wrote: > > Hi K

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Averell
Hi Kostas, Thanks for the info. Just one more question regarding writing parquet. I need to write my stream as parquet to S3. As per this ticket https://issues.apache.org/jira/browse/FLINK-9752 , it is now not supported. Is there any ready-to-us

Re: Streaming to Parquet Files in HDFS

2018-10-05 Thread Kostas Kloudas
Hi Averell, You are right that for Bulk Formats like Parquet, we roll on every checkpoint. This is currently a limitation that has to do with the fact that bulk formats gather and rely on metadata that they keep internally and which we cannot checkpoint in Flink,as they do not expose them. Set

Re: Streaming to Parquet Files in HDFS

2018-10-04 Thread Averell
Hi Fabian, Kostas, >From the description of this ticket https://issues.apache.org/jira/browse/FLINK-9753, I understand that now my output parquet file with StreamingFileSink will span multiple checkpoints. However, when I tried (as in the here below code snippet) I still see that one "part-X-X" fi

Re: Streaming to Parquet Files in HDFS

2018-10-01 Thread Biswajit Das
Nice to see this finally! On Mon, Oct 1, 2018 at 1:53 AM Fabian Hueske wrote: > Hi Bill, > > Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the > previously mentioned StreamingFileSink [1], [2]. > > Best, Fabian > > [1] https://issues.apache.org/jira/browse/FLINK-9753 > [

Re: Streaming to Parquet Files in HDFS

2018-10-01 Thread Fabian Hueske
Hi Bill, Flink 1.6.0 supports writing Avro records as Parquet files to HDFS via the previously mentioned StreamingFileSink [1], [2]. Best, Fabian [1] https://issues.apache.org/jira/browse/FLINK-9753 [2] https://issues.apache.org/jira/browse/FLINK-9750 Am Fr., 28. Sep. 2018 um 23:36 Uhr schrieb

Re: Streaming to Parquet Files in HDFS

2018-09-28 Thread hao gao
Hi Bill, I wrote those two medium posts you mentioned above. But clearly, the techlab one is much better I would suggest just "close the file when checkpointing" which is the easiest way. If you use BucketingSink, you can modify the code to make it work. Just replace the code from line 691 to 693

Re: streaming predictions

2018-07-24 Thread Andrea Spina
Dear Cederic, I did something similar as yours a while ago along this work [1] but I've always been working within the batch context. I'm also the co-author of flink-jpmml and, since a flink2pmml model saver library doesn't exist currently, I'd suggest you a twofold strategy to tackle this problem:

Re: streaming predictions

2018-07-24 Thread David Anderson
One option (which I haven't tried myself) would be to somehow get the model into PMML format, and then use https://github.com/FlinkML/flink-jpmml to score the model. You could either use another machine learning framework to train the model (i.e., a framework that directly supports PMML export), or

Re: streaming predictions

2018-07-22 Thread Xingcan Cui
Hi Cederic, If the model is a simple function, you can just load it and make predictions using the map/flatMap function in the StreamEnvironment. But I’m afraid the model trained by Flink-ML should be a “batch job", whose predict method takes a Dataset as the parameter and outputs another Datas

Re: streaming predictions

2018-07-22 Thread Hequn Cheng
Hi Cederic, I am not familiar with SVM or machine learning but I think we can work it out together. What problem have you met when you try to implement this function? From my point of view, we can rebuild the model in the flatMap function and use it to predict the input data. There are some flatMa

Re: Streaming

2018-06-28 Thread Hequn Cheng
Hi aitozi, 1> how will sql translated into a datastream job? The Table API and SQL leverage Apache Calcite for parsing, validation, and query optimization. After optimization, the logical plan of the job will be translated into a datastream job. The logical plan contains many different logical ope

Re: Streaming

2018-06-27 Thread aitozi
Hi, all Thanks for your reply. 1. Can i ask how does the SQL like below transform to a low-level datastream job? 2. If i implement a distinct in datastream job, and there is no keyBy needed advance , and we just calculate the global distinct count, Does i just can used the AllWindowedStream or

Re: Streaming

2018-06-27 Thread zhangminglei
Hi, Sihua & Aitozi I would like add more here, As @Sihua said, we need to query the state frequently. Assume if you use redis to store these states, it will consume a lot of your redis resources. So, you can use a bloomfilter before access to redis. If a pv is told to exist by bloomfilter, th

Re: Streaming

2018-06-27 Thread Hequn Cheng
For the above two non-window approaches, the second one achieves a better performance. => For the above two non-window approaches, the second one achieves a better performance in most cases especially when there are many same rows. On Thu, Jun 28, 2018 at 12:25 AM, Hequn Cheng wrote: > Hi aitoz

Re: Streaming

2018-06-27 Thread Hequn Cheng
Hi aitozi, 1> CountDistinct Currently (flink-1.5), CountDistinct is supported in SQL only under window as RongRong described. There are ways to implement non-window CountDistinct, for example: a) you can write a CountDistinct udaf using MapView or b) Use two groupBy to achieve it. The first groupB

Re: Streaming

2018-06-27 Thread Rong Rong
Hi , Stream distinct accumulator is actually supported in SQL API [1]. The syntax is pretty much identical to the batch case. A simple example using the tumbling window will be. > SELECT COUNT(DISTINCT col) > FROM t > GROUP BY TUMBLE(rowtime, INTERVAL '1' MINUTE) I haven't added the support but

  1   2   3   >