Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Holden Karau
pip installing pyspark like that probably isn't a great idea since there isn't a version tagged to it. Probably better to install from the local files copied in than potentially from pypi. Might be able to install in -e mode where it'll do symlinks to save space I'm not sure. On Tue, Aug 17, 2021

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
Thanks Andrew, that was helpful. Step 10/23 : RUN pip install pyyaml numpy cx_Oracle pyspark --no-cache-dir And the reduction in size is considerable, 1.75GB vs 2.19GB . Note that the original run has now been invalidated REPOSITORY TAG IMAGE ID

Re: Nabble archive is down

2021-08-17 Thread Sean Owen
Oh duh, right, much better idea! On Tue, Aug 17, 2021 at 2:56 PM Micah Kornfield wrote: > https://lists.apache.org/list.html?u...@spark.apache.org should be > searchable (although the UI is a little clunky). > > On Tue, Aug 17, 2021 at 12:52 PM Sean Owen wrote: > >> If the links are down and

Re: Observer Namenode and Committer Algorithm V1

2021-08-17 Thread Erik Krogen
Hi Adam, Thanks for this great writeup of the issue. We (LinkedIn) also operate Observer NameNodes, and have observed the same issues, but have not yet gotten around to implementing a proper fix. To add a bit of context from our side, there is at least one other place besides the committer v1

Re: Nabble archive is down

2021-08-17 Thread Micah Kornfield
https://lists.apache.org/list.html?u...@spark.apache.org should be searchable (although the UI is a little clunky). On Tue, Aug 17, 2021 at 12:52 PM Sean Owen wrote: > If the links are down and not evidently coming back, yeah let's change any > website links. Probably best to depend on ASF

Re: Nabble archive is down

2021-08-17 Thread Sean Owen
If the links are down and not evidently coming back, yeah let's change any website links. Probably best to depend on ASF resources foremost, but, the ASF archive isn't searchable: https://mail-archives.apache.org/mod_mbox/spark-user/ What about things like

Nabble archive is down

2021-08-17 Thread Maciej
Hi everyone, It seems like Nabble is downsizing and nX.nabble.com servers, including one with Spark user and dev lists, are already down. Do plan ask them to preserve the content (I haven't seen any related requests on their support forum) or should we update website links to point to the ASF

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
Hi Mich, By default, pip caches downloaded binaries to somewhere like $HOME/.cache/pip. So after doing any "pip install", you'll want to either delete that directory, or pass the "--no-cache-dir" option to pip to prevent the download binaries from being added to the image. HTH Andrew On Tue,

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
Hi Andrew, Can you please elaborate on blowing pip cache before committing the layer? Thanks, Much On Tue, 17 Aug 2021 at 16:57, Andrew Melo wrote: > Silly Q, did you blow away the pip cache before committing the layer? That > always trips me up. > > Cheers > Andrew > > On Tue, Aug 17, 2021

Observer Namenode and Committer Algorithm V1

2021-08-17 Thread Adam Binford
Hi, We ran into an interesting issue that I wanted to share as well as get thoughts on if anything should be done about this. We run our own Hadoop cluster and recently deployed an Observer Namenode to take some burden off of our Active Namenode. We mostly use Delta Lake as our format, and

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Andrew Melo
Silly Q, did you blow away the pip cache before committing the layer? That always trips me up. Cheers Andrew On Tue, Aug 17, 2021 at 10:56 Mich Talebzadeh wrote: > With no additional python packages etc we get 1.4GB compared to 2.19GB > before > > REPOSITORY TAG

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
Examples: *docker images* REPOSITORY TAG IMAGE ID CREATED SIZE spark/spark-py 3.1.1_sparkpy_3.7-scala_2.12-java8 ba3c17bc9337 2 minutes ago2.19GB spark3.1.1-scala_2.12-java11 4595c4e78879 18 minutes ago

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
3.1.2_sparkpy_3.7-scala_2.12-java11 3.1.2_sparkR_3.6-scala_2.12-java11 Yes let us go with that and remember that we can change the tags anytime. The accompanying release note should detail what is inside the image downloaded. +1 for me view my Linkedin profile

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Maciej
You're right, but with the native dependencies (this is the case for the packages I've mentioned before) we have to bundle complete environments. It is doable, but if you do that, you're actually better off with base image. I don't insist it is something we have to address right now, just

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
Of course with PySpark, there is the option of putting your packages in gz format and send them at spark-submit time --conf "spark.yarn.dist.archives"=pyspark_venv.tar.gz#environment \ However, in the Kubernetes cluster that file is going to be fairly massive and will take time to unzip and

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
An interesting point. Do we have a repository for current containers. I am not aware of it. view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Holden Karau
I as well think the largest use case of docker images would be on Kubernetes. While I have nothing against us adding more varieties I think it’s important for us to get this started with our current containers, so I’ll do that but let’s certainly continue exploring improvements after that. On

Re: Time to start publishing Spark Docker Images?

2021-08-17 Thread Mich Talebzadeh
Thanks for the notes all. I think we ought to consider what docker general usage is. Docker image by definition is a self contained general purpose entity providing spark service at the common denominator. Some docker images like the one for jenkins are far simpler to build as they have less