Streaming to Parquet Files in HDFS

William Speirs Fri, 28 Sep 2018 13:58:19 -0700

I'm trying to stream log messages (syslog fed into Kafak) into Parquet
files on HDFS via Flink. I'm able to read, parse, and construct objects for
my messages in Flink; however, writing to Parquet is tripping me up. I do
*not* need to have this be real-time; a delay of a few minutes, even up to
an hour, is fine.


I've found the following articles talking about this being very difficult:
*
https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401
* https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519
*
https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/
*
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html

All of these posts speak of troubles using the check-pointing mechanisms
and Parquets need to perform batch writes. I'm not experienced enough with
Flink's check-pointing or Parquet's file format to completely understand
the issue. So my questions are as follows:

1) Is this possible in Flink in an exactly-once way? If not, is it possible
in a way that _might_ cause duplicates during an error?

2) Is there another/better format to use other than Parquet that offers
compression and the ability to be queried by something like Drill or Impala?

3) Any further recommendations for solving the overall problem: ingesting
syslogs and writing them to a file(s) that is searchable by an SQL(-like)
framework?

Thanks!

Bill-

Streaming to Parquet Files in HDFS

Reply via email to