Re: Reading from HDFS and publishing to Kafka

Aljoscha Krettek Tue, 29 Sep 2020 00:32:40 -0700

Hi,

I actually have no experience running a Flink job on K8s against akerberized HDFS so please take what I'll say with a grain of salt.

The only thing you should need to do is to configure the path of yourkeytab and possibly some other Kerberos settings. For that check out [1]and [2].

I think in general this looks like the right approach and Romanscomments are correct as well.


Aljoscha

[1]https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/security-kerberos.html#yarnmesos-mode[2]https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#auth-with-external-systems


On 27.09.20 21:54, Khachatryan Roman wrote:

Hi,

1. Yes, StreamingExecutionEnvironment.readFile can be used for files on HDFS
2. I think this is a valid concern. Besides that, there are plans to
deprecate DataSet API [1]
4. Yes, the approach looks good

I'm pulling in Aljoscha for your 3rd question (and probably some
clarifications on others).

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741

Regards,
Roman


On Fri, Sep 25, 2020 at 12:50 PM Damien Hawes <marley.ha...@gmail.com>
wrote:

Hi folks,

I've got the following use case, where I need to read data from HDFS and
publish the data to Kafka, such that it can be reprocessed by another job.

I've searched the web and read the docs. This has turned up no and
concrete examples or information of how this is achieved, or even if it's
possible at all.

Further context:

1. Flink will be deployed to Kubernetes.
2. Kerberos is active on Hadoop.
3. The data is stored on HDFS as Avro.
4. I cannot install Flink on our Hadoop environment.
5. No stateful computations will be performed.

I've noticed that the flink-avro package provides a class called
AvroInputFormat<T>, with a nullable path field, and I think this is my goto.

Apologies for the poor formatting ahead, but the code I have in mind looks
something like this:



StreamingExecutionEnvironment env = ...;
AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
// rest, + publishing to Kafka using the FlinkKafkaProducer



My major questions and concerns are:

1. Is it possible to use read from HDFS using the
StreamingExecutionEnvironment object? I'm planning on using the Data Stream
API because of point (2) below.
2. Because Flink will be deployed on Kubernetes, I have a major concern
that if I were to use the Data Set API, once Flink completes and exits, the
pods will restart, causing unnecessary duplication of data. Is the pod
restart a valid concern?
3. Is there anything special I need to be worried about regarding Kerberos
in this instance? The key tab will be materialised on the pods upon start
up.
4. Is this even a valid approach? The dataset I need to read and replay is
small (12 TB).

Any help, even in part will be appreciated.

Kind regards,

Damien

Re: Reading from HDFS and publishing to Kafka

Reply via email to