Hi,

This is a bit of an old hat but worth getting opinions on it.

Current options that I believe apply are:


   1. Installing them individually via pip in the docker build process
   2. Installing them together via pip in the build process via
   requirments.txt
   3. Installing them to a volume and adding the volume to the PYTHONPATH

>From my experience there is a case of installing them at docker build
process:

RUN pip install pyyaml --no-cache-dir
RUN pip install --no-cache-dir -r requirements.txt

or using the following in spark-submit

--archives pyspark_venv.tar.gz#environment

The problem with archives as I have noticed that unzipping and
untarring packages takes a considerable time and sometimes spark-submit
hangs! with in-built docker the version of package may get out of date,
although this has not been an issue for me.

So there are pros and cons either way. However, with the CICD pipeline, we
can build docker files with higher frequencies if needed.

Docker files have a drawback of the more packages, the more the docker size
and of course pulling it all from the container registry (ecr, gcr etc),
will consume more time and will impact the deployment time. I still favour
1 or 2 above.

Thanks

Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to