Hi Bill, I wrote those two medium posts you mentioned above. But clearly, the techlab one is much better I would suggest just "close the file when checkpointing" which is the easiest way. If you use BucketingSink, you can modify the code to make it work. Just replace the code from line 691 to 693 with closeCurrentPartFile() https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691 This should guarantee exactly-once. You may have some files with underscore prefix when flink job failed. But usually those files are ignored by the query engine/ readers for example, Presto
If you use 1.6 and later, I think the issue is already addressed https://issues.apache.org/jira/browse/FLINK-9750 Thanks Hao On Fri, Sep 28, 2018 at 1:57 PM William Speirs <wspe...@apache.org> wrote: > I'm trying to stream log messages (syslog fed into Kafak) into Parquet > files on HDFS via Flink. I'm able to read, parse, and construct objects for > my messages in Flink; however, writing to Parquet is tripping me up. I do > *not* need to have this be real-time; a delay of a few minutes, even up to > an hour, is fine. > > I've found the following articles talking about this being very difficult: > * > https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401 > * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519 > * > https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/ > * > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html > > All of these posts speak of troubles using the check-pointing mechanisms > and Parquets need to perform batch writes. I'm not experienced enough with > Flink's check-pointing or Parquet's file format to completely understand > the issue. So my questions are as follows: > > 1) Is this possible in Flink in an exactly-once way? If not, is it > possible in a way that _might_ cause duplicates during an error? > > 2) Is there another/better format to use other than Parquet that offers > compression and the ability to be queried by something like Drill or Impala? > > 3) Any further recommendations for solving the overall problem: ingesting > syslogs and writing them to a file(s) that is searchable by an SQL(-like) > framework? > > Thanks! > > Bill- > -- Thanks - Hao