Great, thank you so much for your responses. It all makes sense now. :) On Mon, Jan 30, 2023 at 10:41 PM Dian Fu <dian0511...@gmail.com> wrote:
> >> What is the reason for including > opt/python/{pyflink.zip,cloudpickle.zip,py4j.zip} in the base > distribution then? Oh, a guess: to make it easier for TaskManagers to run > pyflink without having pyflink installed themselves? Somehow I'd guess > this wouldn't work tho; I'd assume TaskManagers would also need some python > transitive dependencies, e.g. google protobuf. > > It has some historical reasons. In the first version (1.9.x) which has not > provided Python UDF support, it's not necessary to install PyFlink in the > nodes of TaskManagers. Since 1.10 which supports Python UDF, users have to > install PyFlink in the nodes of TaskManager as there are many transitive > dependencies, e.g. Apache Beam、protobuf、pandas, etc. However, we have not > removed these packages as they are still useful for client node which is > responsible for compiling jobs(it's not necessary to install PyFlink in the > client node). > > >> Since we're building our own Docker image, I'm going the other way > around: just install pyflink, and symlink /opt/flink -> > /usr/lib/python3.7/dist-packages/pyflink. So far so good, but I'm > worried that something will be fishy when trying to run JVM apps via > pyflink. > > Good idea! It contains all the things necessary needed to run JVM apps in > the PyFlink package and so I think you could just try this way. > > Regards, > Dian > > On Mon, Jan 30, 2023 at 9:58 PM Andrew Otto <o...@wikimedia.org> wrote: > >> Thanks Dian! >> >> > >> Is using pyflink from the flink distribution tarball (without pip) >> not a supported way to use pyflink? >> > You are right. >> >> What is the reason for including >> opt/python/{pyflink.zip,cloudpickle.zip,py4j.zip} in the base >> distribution then? Oh, a guess: to make it easier for TaskManagers to run >> pyflink without having pyflink installed themselves? Somehow I'd guess >> this wouldn't work tho; I'd assume TaskManagers would also need some python >> transitive dependencies, e.g. google protobuf. >> >> > you could remove the JAR packages located under >> /usr/local/lib/python3.7/dist-packages/pyflink/lib manually after `pip >> install apache-flink` >> >> Since we're building our own Docker image, I'm going the other way >> around: just install pyflink, and symlink /opt/flink -> >> /usr/lib/python3.7/dist-packages/pyflink. So far so good, but I'm worried >> that something will be fishy when trying to run JVM apps via pyflink. >> >> -Ao >> >> >> >> On Sun, Jan 29, 2023 at 1:43 AM Dian Fu <dian0511...@gmail.com> wrote: >> >>> Hi Andrew, >>> >>> >> By pip installing apache-flink, this docker image will have the flink >>> distro installed at /opt/flink and FLINK_HOME set to /opt/flink >>> <https://github.com/apache/flink-docker/blob/master/1.16/scala_2.12-java11-ubuntu/Dockerfile>. >>> BUT ALSO flink lib jars will be installed at e.g. >>> /usr/local/lib/python3.7/dist-packages/pyflink/lib! >>> So, by following those instructions, flink is effectively installed >>> twice into the docker image. >>> >>> Yes, your understanding is correct. The base image `flink:1.15.2` >>> doesn't include PyFlink and so you need to build a custom image if you want >>> to use PyFlink. Regarding to the jar packages which are installed twice, >>> you could remove the JAR packages located under >>> /usr/local/lib/python3.7/dist-packages/pyflink/lib manually after `pip >>> install apache-flink`. It will use the JAR packages located under >>> $FLINK_HOME/lib. >>> >>> >> Is using pyflink from the flink distribution tarball (without pip) >>> not a supported way to use pyflink? >>> You are right. >>> >>> Regards, >>> Dian >>> >>> >>> On Thu, Jan 26, 2023 at 11:12 PM Andrew Otto <o...@wikimedia.org> wrote: >>> >>>> Ah, oops and my original email had a typo: >>>> > Some python dependencies are not included in the flink distribution >>>> tarballs: cloudpickle, py4j and pyflink are in opt/python. >>>> >>>> Should read: >>>> > Some python dependencies ARE included in the flink distribution >>>> tarballs: cloudpickle, py4j and pyflink are in opt/python. >>>> >>>> On Thu, Jan 26, 2023 at 10:10 AM Andrew Otto <o...@wikimedia.org> >>>> wrote: >>>> >>>>> Let me ask a related question: >>>>> >>>>> We are building our own base Flink docker image. We will be deploying >>>>> both JVM and python apps via flink-kubernetes-operator. >>>>> >>>>> Is there any reason not to install Flink in this image via `pip >>>>> install apache-flink` and use it for JVM apps? >>>>> >>>>> -Andrew Otto >>>>> Wikimedia Foundation >>>>> >>>>> >>>>> >>>>> On Tue, Jan 24, 2023 at 4:26 PM Andrew Otto <o...@wikimedia.org> >>>>> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I'm having quite a bit of trouble running pyflink from the default >>>>>> flink distribution tarballs. I'd expect the python examples to work as >>>>>> long as python is installed, and we've got the distribution. Some python >>>>>> dependencies are not included in the flink distribution tarballs: >>>>>> cloudpickle, py4j and pyflink are in opt/python. Others are not, e.g. >>>>>> protobuf. >>>>>> >>>>>> Now that I'm looking, I see that the pyflink installation >>>>>> instructions >>>>>> <https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/dev/python/installation/> >>>>>> are >>>>>> to install via pip. >>>>>> >>>>>> I'm doing this in Docker for use with the flink-kubernetes-operator. >>>>>> In the Using Flink Python on Docker >>>>>> <https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/docker/#using-flink-python-on-docker> >>>>>> instructions, >>>>>> there is a pip3 install apache-flink step. I find this strange, since >>>>>> I'd >>>>>> expect the 'FROM flink:1.15.2' part to be sufficient. >>>>>> >>>>>> By pip installing apache-flink, this docker image will have the flink >>>>>> distro installed at /opt/flink and FLINK_HOME set to /opt/flink >>>>>> <https://github.com/apache/flink-docker/blob/master/1.16/scala_2.12-java11-ubuntu/Dockerfile>. >>>>>> BUT ALSO flink lib jars will be installed at e.g. >>>>>> /usr/local/lib/python3.7/dist-packages/pyflink/lib! >>>>>> So, by following those instructions, flink is effectively installed >>>>>> twice into the docker image. >>>>>> >>>>>> Am I correct or am I missing something? >>>>>> >>>>>> Is using pyflink from the flink distribution tarball (without pip) >>>>>> not a supported way to use pyflink? >>>>>> >>>>>> Thanks! >>>>>> -Andrew Otto >>>>>> Wikimedia Foundation >>>>>> >>>>>>