The problem was a misconfiguration of the initContainer which would copy my
artifacts from s3 to an ephemeral volume.  This caused the task manager to
get started for a bit and then to be shut down.  It was hard to get logging
about this since the pods were gone before I could get logging from it.  I
chalk all that up to just me lacking a bit of experience with k8s.

That being said... It's all working now and I documented the deployment
over here:

https://hop.apache.org/manual/next/pipeline/beam/flink-k8s-operator-running-hop-pipeline.html

A big thank you to everyone that helped me out!

Cheers,
Matt

On Mon, Jun 27, 2022 at 4:59 AM Yang Wang <danrtsey...@gmail.com> wrote:

> Could you please share the JobManager logs of failed deployment? It will
> also help a lot if you could show the pending pod status via "kubectl
> describe <pod_name>".
>
> Given that the current Flink Kubernetes Operator is built on top of native
> K8s integration[1], the Flink ResourceManager should allocate enough
> TaskManager pods automatically.
> We need to find out what is wrong via the logs. Maybe the service account
> or taint or something else.
>
>
> [1]. https://flink.apache.org/2021/02/10/native-k8s-with-ha.html
>
>
> Best,
> Yang
>
> Matt Casters <matt.cast...@neotechnology.com> 于2022年6月24日周五 23:48写道:
>
>> Yes of-course.  I already feel a bit less intelligent for having asked
>> the question ;-)
>>
>> The status now is that I managed to have it all puzzled together.
>> Copying the files from s3 to an ephemeral volume takes all of 2 seconds so
>> it's really not an issue.  The cluster starts and our fat jar and Apache
>> Hop MainBeam class is found and started.
>>
>> The only thing that remains is figuring out how to configure the Flink
>> cluster itself.  I have a couple of m5.large ec2 instances in a node group
>> on EKS and I set taskmanager.numberOfTaskSlots to "4".  However, the tasks
>> in the pipeline can't seem to find resources to start.
>>
>> Caused by:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout
>>
>> Parallelism was set to 1 for the runner and there are only 2 tasks in my
>> first Beam pipeline so it should be simple enough but it just times out.
>>
>> Next step for me is to document the result which will end up on
>> hop.apache.org.   I'll probably also want to demo this in Austin at the
>> upcoming Beam summit.
>>
>> Thanks a lot for your time and help so far!
>>
>> Cheers,
>> Matt
>>
>>

Reply via email to