Hi,
I actually have no experience running a Flink job on K8s against a
kerberized HDFS so please take what I'll say with a grain of salt.
The only thing you should need to do is to configure the path of your
keytab and possibly some other Kerberos settings. For that check out [1]
and [2].
I think in general this looks like the right approach and Romans
comments are correct as well.
Aljoscha
[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/security-kerberos.html#yarnmesos-mode
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#auth-with-external-systems
On 27.09.20 21:54, Khachatryan Roman wrote:
Hi,
1. Yes, StreamingExecutionEnvironment.readFile can be used for files on HDFS
2. I think this is a valid concern. Besides that, there are plans to
deprecate DataSet API [1]
4. Yes, the approach looks good
I'm pulling in Aljoscha for your 3rd question (and probably some
clarifications on others).
[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158866741
Regards,
Roman
On Fri, Sep 25, 2020 at 12:50 PM Damien Hawes <marley.ha...@gmail.com>
wrote:
Hi folks,
I've got the following use case, where I need to read data from HDFS and
publish the data to Kafka, such that it can be reprocessed by another job.
I've searched the web and read the docs. This has turned up no and
concrete examples or information of how this is achieved, or even if it's
possible at all.
Further context:
1. Flink will be deployed to Kubernetes.
2. Kerberos is active on Hadoop.
3. The data is stored on HDFS as Avro.
4. I cannot install Flink on our Hadoop environment.
5. No stateful computations will be performed.
I've noticed that the flink-avro package provides a class called
AvroInputFormat<T>, with a nullable path field, and I think this is my goto.
Apologies for the poor formatting ahead, but the code I have in mind looks
something like this:
StreamingExecutionEnvironment env = ...;
AvroInputFormat<Source> inf = new AvroInputFormat(null, Source.class);
DataStreamSource<Source> stream = env.readFile(inf, "hdfs://path/to/data");
// rest, + publishing to Kafka using the FlinkKafkaProducer
My major questions and concerns are:
1. Is it possible to use read from HDFS using the
StreamingExecutionEnvironment object? I'm planning on using the Data Stream
API because of point (2) below.
2. Because Flink will be deployed on Kubernetes, I have a major concern
that if I were to use the Data Set API, once Flink completes and exits, the
pods will restart, causing unnecessary duplication of data. Is the pod
restart a valid concern?
3. Is there anything special I need to be worried about regarding Kerberos
in this instance? The key tab will be materialised on the pods upon start
up.
4. Is this even a valid approach? The dataset I need to read and replay is
small (12 TB).
Any help, even in part will be appreciated.
Kind regards,
Damien