I'm trying to stream log messages (syslog fed into Kafak) into Parquet files on HDFS via Flink. I'm able to read, parse, and construct objects for my messages in Flink; however, writing to Parquet is tripping me up. I do *not* need to have this be real-time; a delay of a few minutes, even up to an hour, is fine.
I've found the following articles talking about this being very difficult: * https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401 * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519 * https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/ * http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html All of these posts speak of troubles using the check-pointing mechanisms and Parquets need to perform batch writes. I'm not experienced enough with Flink's check-pointing or Parquet's file format to completely understand the issue. So my questions are as follows: 1) Is this possible in Flink in an exactly-once way? If not, is it possible in a way that _might_ cause duplicates during an error? 2) Is there another/better format to use other than Parquet that offers compression and the ability to be queried by something like Drill or Impala? 3) Any further recommendations for solving the overall problem: ingesting syslogs and writing them to a file(s) that is searchable by an SQL(-like) framework? Thanks! Bill-