Re: Streaming to Parquet Files in HDFS

hao gao Fri, 28 Sep 2018 14:37:26 -0700

Hi Bill,

I wrote those two medium posts you mentioned above. But clearly, the
techlab one is much better
I would suggest just "close the file when checkpointing" which is the
easiest way. If you use BucketingSink, you can modify the code to make it
work. Just replace the code from line 691 to 693 with
closeCurrentPartFile()
https://github.com/apache/flink/blob/release-1.3.2-rc1/flink-connectors/flink-connector-filesystem/src/main/java/org/apache/flink/streaming/connectors/fs/bucketing/BucketingSink.java#L691
This should guarantee exactly-once. You may have some files with underscore
prefix when flink job failed. But usually those files are ignored by the
query engine/ readers for example,  Presto


If you use 1.6 and later, I think the issue is already addressed
https://issues.apache.org/jira/browse/FLINK-9750

Thanks
Hao

On Fri, Sep 28, 2018 at 1:57 PM William Speirs <wspe...@apache.org> wrote:

> I'm trying to stream log messages (syslog fed into Kafak) into Parquet
> files on HDFS via Flink. I'm able to read, parse, and construct objects for
> my messages in Flink; however, writing to Parquet is tripping me up. I do
> *not* need to have this be real-time; a delay of a few minutes, even up to
> an hour, is fine.
>
> I've found the following articles talking about this being very difficult:
> *
> https://medium.com/hadoop-noob/a-realtime-flink-parquet-data-warehouse-df8c3bd7401
> * https://medium.com/hadoop-noob/flink-parquet-writer-d127f745b519
> *
> https://techlab.bol.com/how-not-to-sink-a-data-stream-to-files-journeys-from-kafka-to-parquet/
> *
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Rolling-sink-parquet-Avro-output-td11123.html
>
> All of these posts speak of troubles using the check-pointing mechanisms
> and Parquets need to perform batch writes. I'm not experienced enough with
> Flink's check-pointing or Parquet's file format to completely understand
> the issue. So my questions are as follows:
>
> 1) Is this possible in Flink in an exactly-once way? If not, is it
> possible in a way that _might_ cause duplicates during an error?
>
> 2) Is there another/better format to use other than Parquet that offers
> compression and the ability to be queried by something like Drill or Impala?
>
> 3) Any further recommendations for solving the overall problem: ingesting
> syslogs and writing them to a file(s) that is searchable by an SQL(-like)
> framework?
>
> Thanks!
>
> Bill-
>


-- 
Thanks
 - Hao

Re: Streaming to Parquet Files in HDFS

Reply via email to