Re: Flink + Hive + Compaction + Parquet?

Kurt Young Mon, 15 Mar 2021 21:14:44 -0700

Hi Theo,

Regarding your first 2 questions, the answer is yes Flink supports
streaming write to Hive.
And Flink also supports automatically compacting small files during
streaming write [1].
(Hive and Filesystem shared the same mechanism to do compaction, we forgot
to add a dedicated document for hive.)


And you don't need the hive transaction table for this because Flink will
compact all the small files
_before_ commit the files or partition to hive. From hive's perspective,
the written files are already
large files.

I think this might address most of your confusions and let me know if you
have further questions.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/connectors/filesystem.html#file-compaction

Best,
Kurt


On Mon, Mar 15, 2021 at 5:05 PM Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> What about using Apache Hudi o Apache Iceberg?
>
> On Thu, Mar 4, 2021 at 10:15 AM Dawid Wysakowicz <dwysakow...@apache.org>
> wrote:
>
>> Hi,
>>
>> I know Jingsong worked on Flink/Hive filesystem integration in the
>> Table/SQL API. Maybe he can shed some light on your questions.
>>
>> Best,
>>
>> Dawid
>> On 02/03/2021 21:03, Theo Diefenthal wrote:
>>
>> Hi there,
>>
>> Currently, I have a Flink 1.11 job which writes parquet files via the
>> StreamingFileSink to HDFS (simply using DataStream API). I commit like
>> every 3 minutes and thus have many small files in HDFS. Downstream, the
>> generated table is consumed from Spark Jobs and Impala queries. HDFS
>> doesn't like to have too many small files and writing to parquet fast but
>> also desiring large files is a rather common problem and solutions were
>> suggested like recently in the mailing list [1] or in flink forward talks
>> [2]. Cloudera also posted two possible scenarios in their blog posts [3],
>> [4]. Mostly, it comes down to asynchronously compact the many small files
>> into larger ones, at best non blocking and in an occasionally running batch
>> job.
>>
>> I am now about to implement something like suggested in the cloudera blog
>> [4] but from parquet to parquet. For me, it seems to be not straight
>> forward but rather involved, especially as my data is partitioned in
>> eventtime and I need the compaction to be non blocking (my users query
>> impala and expect near real time performance in each query). When starting
>> the work on that, I noticed that Hive already has a compaction mechanism
>> included and the Flink community works a lot in terms of integrating with
>> hive in the latest releases. Some of my questions are not directly related
>> to Flink, but I guess many of you have also experience with hive and
>> writing from Flink to Hive is rather common nowadays.
>>
>> I read online that Spark should integrate nicely with Hive tables, i.e.
>> instead of querying HDFS files, querying a hive table has the same
>> performance [5]. We also all know that Impala integrates nicely with Hive
>> so that overall, I can expect writing to Hive internal tables instead of
>> HDFS parquet doesn't have any disadvantages for me.
>>
>> My questions:
>> 1. Can I use Flink to "streaming write" to Hive?
>> 2. For compaction, I need "transactional tables" and according to the
>> hive docs, transactional tables must be fully managed by hive (i.e., they
>> are not external). Does Flink support writing to those out of the box? (I
>> only have Hive 2 available)
>> 3. Does Flink use the "Hive Streaming Data Ingest" APIs?
>> 4. Do you see any downsides in writing to hive compared to writing to
>> parquet directly? (Especially in my usecase only having impala and spark
>> consumers)
>> 5. Not Flink related: Have you ever experienced performance issues when
>> using hive transactional tables over writing parquet directly? I guess
>> there must be a reason why "transactional" is off by default in Hive? I
>> won't use any features except for compaction, i.e. there are only streaming
>> inserts, no updates, no deletes. (Delete only after given retention and
>> always delete entire partitions)
>>
>>
>> Best regards
>> Theo
>>
>> [1]
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Streaming-data-to-parquet-td38029.html
>> [2] https://www.youtube.com/watch?v=eOQ2073iWt4
>> [3]
>> https://blog.cloudera.com/how-to-ingest-and-query-fast-data-with-impala-without-kudu/
>> [4]
>> https://blog.cloudera.com/transparent-hierarchical-storage-management-with-apache-kudu-and-impala/
>> [5]
>> https://stackoverflow.com/questions/51190646/spark-dataset-on-hive-vs-parquet-file
>>
>>

Re: Flink + Hive + Compaction + Parquet?

Reply via email to