Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathanwrote: > We can start by getting a PR going perhaps, and start augmenting the > integration testing to ensure that there are no surprises - with/without > credentials, accessing GCS, S3 etc as well. > When we get enough confidence and test coverage, let's merge this in. > Does that sound like a reasonable path forward? I think it's beneficial to separate this into two separate things as far as discussion goes: - using spark-submit: the code should definitely be starting the driver using spark-submit, and potentially the executor using spark-class. - separately, we can decide on whether to keep or remove init containers. Unfortunately, code-wise, those are not separate. If you get rid of init containers, my current p.o.c. has most of the needed changes (only lightly tested). But if you keep init containers, you'll need to mess with the configuration so that spark-submit never sees spark.jars / spark.files, so it doesn't trigger its dependency download code. (YARN does something similar, btw.) That will surely mean different changes in the current k8s code (which I wanted to double check anyway because I remember seeing some oddities related to those configs in the logs). To comment on one point made by Andrew: > there's almost a parallel here with spark.yarn.archive, where that configures > the cluster (YARN) to do distribution pre-runtime That's more of a parallel to the docker image; spark.yarn.archive points to a jar file with Spark jars in it so that YARN can make Spark available to the driver / executors running in the cluster. Like the docker image, you could include other stuff that is not really part of standard Spark in that archive too, or even not have Spark at all there, if you want things to just fail. :-) -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
It seems we have two standard practices for resource distribution in place here: - the Spark way is that the application (Spark) distributes the resources *during* app execution, and does this by exposing files/jars on an http server on the driver (or pre-staged elsewhere), and executors downloading from that location (driver or remote) - the Kubernetes way is that the cluster manager (Kubernetes) distributes the resources *before* app execution, and does this primarily via docker images, and secondarily through init containers for non-image resources. I'd imagine a motivation for this choice in k8s' part is immutability of the application at runtime When the Spark and K8s standard practices are in conflict (as they seem to be here), which convention should be followed? Looking at the Spark-on-YARN integration, there's almost a parallel here with spark.yarn.archive, where that configures the cluster (YARN) to do distribution pre-runtime instead of the application mid-runtime. Based purely on the lines-of-code removal, right now I lean towards eliminating init containers. It doesn't seem like credential segregation between init container and main pod container is that valuable right now, and the retryability could/should be in all of Spark's cluster managers, not just k8s. So I support Anirudh's suggestion to move towards bringing the change demonstrated in Marcelo's POC into master. On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan < ramanath...@google.com.invalid> wrote: > Thanks for this discussion everyone. It has been very useful in getting an > overall understanding here. > I think in general, consensus is that this change doesn't introduce > behavioral changes, and it's definitely an advantage to reuse the > constructs that Spark provides to us. > > Moving on to a different question here - of pushing this through to Spark. > The init containers have been tested over the past two Spark releases by > external users and integration testing - and this would be a fundamental > change to that behavior. > We should work on getting enough test coverage and confidence here. > > We can start by getting a PR going perhaps, and start augmenting the > integration testing to ensure that there are no surprises - with/without > credentials, accessing GCS, S3 etc as well. > When we get enough confidence and test coverage, let's merge this in. > Does that sound like a reasonable path forward? > > > > On Wed, Jan 10, 2018 at 2:53 PM, Marcelo Vanzin> wrote: > >> On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheah wrote: >> > those sidecars may perform side effects that are undesirable if the >> main Spark application failed because dependencies weren’t available >> >> If the contract is that the Spark driver pod does not have an init >> container, and the driver handles its own dependencies, then by >> definition that situation cannot exist. >> >> -- >> Marcelo >> > > > > -- > Anirudh Ramanathan >
Re: Kubernetes: why use init containers?
Thanks for this discussion everyone. It has been very useful in getting an overall understanding here. I think in general, consensus is that this change doesn't introduce behavioral changes, and it's definitely an advantage to reuse the constructs that Spark provides to us. Moving on to a different question here - of pushing this through to Spark. The init containers have been tested over the past two Spark releases by external users and integration testing - and this would be a fundamental change to that behavior. We should work on getting enough test coverage and confidence here. We can start by getting a PR going perhaps, and start augmenting the integration testing to ensure that there are no surprises - with/without credentials, accessing GCS, S3 etc as well. When we get enough confidence and test coverage, let's merge this in. Does that sound like a reasonable path forward? On Wed, Jan 10, 2018 at 2:53 PM, Marcelo Vanzinwrote: > On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheah wrote: > > those sidecars may perform side effects that are undesirable if the main > Spark application failed because dependencies weren’t available > > If the contract is that the Spark driver pod does not have an init > container, and the driver handles its own dependencies, then by > definition that situation cannot exist. > > -- > Marcelo > -- Anirudh Ramanathan
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheahwrote: > those sidecars may perform side effects that are undesirable if the main > Spark application failed because dependencies weren’t available If the contract is that the Spark driver pod does not have an init container, and the driver handles its own dependencies, then by definition that situation cannot exist. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
With regards to separation of concerns, there’s a fringe use case here – if more than one main container is on the pod, then none of them will run if the init-containers fail. A user can have a Pod Preset that attaches more sidecar containers to the driver and/or executors. In that case, those sidecars may perform side effects that are undesirable if the main Spark application failed because dependencies weren’t available. Using the init-container to localize the dependencies will prevent any of these sidecars from executing at all if the dependencies can’t be fetched. It’s definitely a niche use case – I’m not sure how often pod presets are used in practice - but it’s an example to illustrate why the separation of concerns can be beneficial. -Matt Cheah On 1/10/18, 2:36 PM, "Marcelo Vanzin"wrote: On Wed, Jan 10, 2018 at 2:30 PM, Yinan Li wrote: > 1. Retries of init-containers are automatically supported by k8s through pod > restart policies. For this point, sorry I'm not sure how spark-submit > achieves this. Great, add that feature to spark-submit, everybody benefits, not just k8s. > 2. The ability to use credentials that are not shared with the main > containers. Not sure what that achieves. > 3. Not only the user code, but Spark internal code like Executor won't be > run if the init-container fails. Not sure what that achieves. Executor will fail if dependency download fails, Spark driver will recover (and start a new executor if needed). > 4. Easier to build tooling around k8s events/status of the init-container in > case of failures as it's doing exactly one thing: downloading dependencies. Again, I don't see what is all this hoopla about fine grained control of dependency downloads. Spark solved this years ago for Spark applications. Don't reinvent the wheel. -- Marcelo smime.p7s Description: S/MIME cryptographic signature
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 2:30 PM, Yinan Liwrote: > 1. Retries of init-containers are automatically supported by k8s through pod > restart policies. For this point, sorry I'm not sure how spark-submit > achieves this. Great, add that feature to spark-submit, everybody benefits, not just k8s. > 2. The ability to use credentials that are not shared with the main > containers. Not sure what that achieves. > 3. Not only the user code, but Spark internal code like Executor won't be > run if the init-container fails. Not sure what that achieves. Executor will fail if dependency download fails, Spark driver will recover (and start a new executor if needed). > 4. Easier to build tooling around k8s events/status of the init-container in > case of failures as it's doing exactly one thing: downloading dependencies. Again, I don't see what is all this hoopla about fine grained control of dependency downloads. Spark solved this years ago for Spark applications. Don't reinvent the wheel. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
> Sorry, but what are those again? So far all the benefits are already > provided by spark-submit... 1. Retries of init-containers are automatically supported by k8s through pod restart policies. For this point, sorry I'm not sure how spark-submit achieves this. 2. The ability to use credentials that are not shared with the main containers. 3. Not only the user code, but Spark internal code like Executor won't be run if the init-container fails. 4. Easier to build tooling around k8s events/status of the init-container in case of failures as it's doing exactly one thing: downloading dependencies. There could be others that I'm not aware of. On Wed, Jan 10, 2018 at 2:21 PM, Marcelo Vanzinwrote: > On Wed, Jan 10, 2018 at 2:16 PM, Yinan Li wrote: > > but we can not rule out the benefits init-containers bring either. > > Sorry, but what are those again? So far all the benefits are already > provided by spark-submit... > > > Again, I would suggest we look at this more thoroughly post 2.3. > > Actually, one of the reasons why I brought this up is that we should > remove init containers from 2.3 unless they're really required for > something. > > Simplifying the code is not the only issue. The init container support > introduces a whole lot of user-visible behavior - like config options > and the execution of a completely separate container that the user can > customize. If removed later, that could be considered a breaking > change. > > So if we ship 2.3 without init containers and add them later if > needed, it's a much better world than flipping that around. > > -- > Marcelo >
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 2:16 PM, Yinan Liwrote: > but we can not rule out the benefits init-containers bring either. Sorry, but what are those again? So far all the benefits are already provided by spark-submit... > Again, I would suggest we look at this more thoroughly post 2.3. Actually, one of the reasons why I brought this up is that we should remove init containers from 2.3 unless they're really required for something. Simplifying the code is not the only issue. The init container support introduces a whole lot of user-visible behavior - like config options and the execution of a completely separate container that the user can customize. If removed later, that could be considered a breaking change. So if we ship 2.3 without init containers and add them later if needed, it's a much better world than flipping that around. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
> 1500 less lines of code trump all of the arguments given so far for > what the init container might be a good idea. We can also reduce the #lines of code by simply refactoring the code in such as way that a lot of code can be shared between configuration of the main container and that of the ini-container. Actually we have been discussing this as one of the things to do right after the 2.3 release and we do have a Jira ticket to track it. It's probably true that none of the arguments we made are convincing enough, but we can not rule out the benefits init-containers bring either. Again, I would suggest we look at this more thoroughly post 2.3. On Wed, Jan 10, 2018 at 2:06 PM, Marcelo Vanzinwrote: > On Wed, Jan 10, 2018 at 2:00 PM, Yinan Li wrote: > > I want to re-iterate on one point, that the init-container achieves a > clear > > separation between preparing an application and actually running the > > application. It's a guarantee provided by the K8s admission control and > > scheduling components that if the init-container fails, the main > container > > won't be run. I think this is definitely positive to have. In the case > of a > > Spark application, the application code and driver/executor code won't > even > > be run if the init-container fails to localize any of the dependencies > > That is also the case with spark-submit... (can't download > dependencies -> spark-submit fails before running user code). > > > Note that we are not blindly opposing getting rid of the init-container, > > it's just that there's still valid reasons to keep it for now > > I'll flip that around: I'm not against having an init container if > it's serving a needed purpose, it's just that nobody is able to tell > me what that needed purpose is. > > 1500 less lines of code trump all of the arguments given so far for > what the init container might be a good idea. > > -- > Marcelo >
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 2:00 PM, Yinan Liwrote: > I want to re-iterate on one point, that the init-container achieves a clear > separation between preparing an application and actually running the > application. It's a guarantee provided by the K8s admission control and > scheduling components that if the init-container fails, the main container > won't be run. I think this is definitely positive to have. In the case of a > Spark application, the application code and driver/executor code won't even > be run if the init-container fails to localize any of the dependencies That is also the case with spark-submit... (can't download dependencies -> spark-submit fails before running user code). > Note that we are not blindly opposing getting rid of the init-container, > it's just that there's still valid reasons to keep it for now I'll flip that around: I'm not against having an init container if it's serving a needed purpose, it's just that nobody is able to tell me what that needed purpose is. 1500 less lines of code trump all of the arguments given so far for what the init container might be a good idea. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
I want to re-iterate on one point, that the init-container achieves a clear separation between preparing an application and actually running the application. It's a guarantee provided by the K8s admission control and scheduling components that if the init-container fails, the main container won't be run. I think this is definitely positive to have. In the case of a Spark application, the application code and driver/executor code won't even be run if the init-container fails to localize any of the dependencies. The result is that it's much easier for users to figure out what's wrong if their applications fail to run: they can tell if the pods are initialized or not and if not, simply check the status/logs of the init-container. Another argument I want to make is we can easily make the init-container to be able to exclusively use certain credentials for downloading dependencies that are not appropriate to be visible in the main containers and therefore should not be shared. This is not achievable using the Spark canonical way. K8s has built-in support for dynamically injecting containers into pods through the admission control process. One use case would be for cluster operators to inject an init-container (e.g., through a admission webhook) for downloading certain dependencies that require certain access-restrictive credentials. Note that we are not blindly opposing getting rid of the init-container, it's just that there's still valid reasons to keep it for now, particularly given that we don't have a solid around client mode yet. Also given that we have been using it in our fork for over a year, we are definitely more confident on the current way of handling remote dependencies as it's been tested more thoroughly. Since getting rid of the init-container is such a significant change, I would suggest that we defer making a decision on if we should get rid of it to 2.4 so we have a more thorough understanding of the pros and cons. On Wed, Jan 10, 2018 at 1:48 PM, Marcelo Vanzinwrote: > On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheah wrote: > >> With a config value set by the submission code, like what I'm doing to > prevent client mode submission in my p.o.c.? > > > > The contract for what determines the appropriate scheduler backend to > instantiate is then going to be different in Kubernetes versus the other > cluster managers. > > There is no contract for how to pick the appropriate scheduler. That's > a decision that is completely internal to the cluster manager code > > -- > Marcelo >
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheahwrote: >> With a config value set by the submission code, like what I'm doing to >> prevent client mode submission in my p.o.c.? > > The contract for what determines the appropriate scheduler backend to > instantiate is then going to be different in Kubernetes versus the other > cluster managers. There is no contract for how to pick the appropriate scheduler. That's a decision that is completely internal to the cluster manager code -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
> With a config value set by the submission code, like what I'm doing to > prevent client mode submission in my p.o.c.? The contract for what determines the appropriate scheduler backend to instantiate is then going to be different in Kubernetes versus the other cluster managers. The cluster manager typically only picks the scheduler backend implementation based on the master URL format plus the deploy mode. Perhaps this is an acceptable tradeoff for being able to leverage spark-submit in the cluster mode deployed driver container. Again though, any flag we expose in spark-submit is a user-facing option that can be set erroneously, which is a practice we shouldn’t be encouraging. Taking a step back though, I think we want to use spark-submit’s internals without using spark-submit itself. Any flags we add to spark-submit are user-facing. We ideally would be able to extract the dependency download + run user main class subroutines from spark-submit, and invoke that in all of the cluster managers. Perhaps this calls for a refactor in spark-submit itself to make some parts reusable in other contexts. Just an idea. On 1/10/18, 1:38 PM, "Marcelo Vanzin"wrote: On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah wrote: > If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode scheduler backend in the future? With a config value set by the submission code, like what I'm doing to prevent client mode submission in my p.o.c.? There are plenty of solutions to that problem if that's what's worrying you. > Something else re: client mode accessibility – if we make client mode accessible to users even if it’s behind a flag, that’s a very different contract from needing to recompile spark-submit to support client mode. The amount of effort required from the user to get to client mode is very different between the two cases Yes. But if we say we don't support client mode, we don't support client mode regardless of how easy it is for the user to fool Spark into trying to run in that mode. -- Marcelo smime.p7s Description: S/MIME cryptographic signature
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheahwrote: > If we use spark-submit in client mode from the driver container, how do we > handle needing to switch between a cluster-mode scheduler backend and a > client-mode scheduler backend in the future? With a config value set by the submission code, like what I'm doing to prevent client mode submission in my p.o.c.? There are plenty of solutions to that problem if that's what's worrying you. > Something else re: client mode accessibility – if we make client mode > accessible to users even if it’s behind a flag, that’s a very different > contract from needing to recompile spark-submit to support client mode. The > amount of effort required from the user to get to client mode is very > different between the two cases Yes. But if we say we don't support client mode, we don't support client mode regardless of how easy it is for the user to fool Spark into trying to run in that mode. -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
If we use spark-submit in client mode from the driver container, how do we handle needing to switch between a cluster-mode scheduler backend and a client-mode scheduler backend in the future? Something else re: client mode accessibility – if we make client mode accessible to users even if it’s behind a flag, that’s a very different contract from needing to recompile spark-submit to support client mode. The amount of effort required from the user to get to client mode is very different between the two cases, and the contract is much clearer when client mode is forbidden in all circumstances, versus client mode being allowed with a specific flag. If we’re saying that we don’t support client mode, we should bias towards making client mode as difficult as possible to access, i.e. impossible with a standard Spark distribution. -Matt Cheah On 1/10/18, 1:24 PM, "Marcelo Vanzin"wrote: On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheah wrote: > I’d imagine this is a reason why YARN hasn’t went with using spark-submit from the application master... I wouldn't use YARN as a template to follow when writing a new backend. A lot of the reason why the YARN backend works the way it does is because of backwards compatibility. IMO it would be much better to change the YARN backend to use spark-submit, because it would immensely simplify the code there. It was a nightmare to get YARN to reach feature parity with other backends because it has to pretty much reimplement everything. But doing that would break pretty much every Spark-on-YARN deployment, so it's not something we can do right now. For the other backends the situation is sort of similar; it probably wouldn't be hard to change standalone's DriverWrapper to also use spark-submit. But that brings potential side effects for existing users that don't exist with spark-on-k8s, because spark-on-k8s is new (the current fork aside). > But using init-containers makes it such that we don’t need to use spark-submit at all Those are actually separate concerns. There are a whole bunch of things that spark-submit provides you that you'd have to replicate in the k8s backend if not using it. Thinks like properly handling special characters in arguments, native library paths, "userClassPathFirst", etc. You get them almost for free with spark-submit, and using an init container does not solve any of those for you. I'd say that using spark-submit is really not up for discussion here; it saves you from re-implementing a whole bunch of code that you shouldn't even be trying to re-implement. Separately, if there is a legitimate need for an init container, then it can be added. But I don't see that legitimate need right now, so I don't see what it's bringing other than complexity. (And no, "the k8s documentation mentions that init containers are sometimes used to download dependencies" is not a legitimate need.) -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org smime.p7s Description: S/MIME cryptographic signature
Re: Kubernetes: why use init containers?
On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheahwrote: > I’d imagine this is a reason why YARN hasn’t went with using spark-submit > from the application master... I wouldn't use YARN as a template to follow when writing a new backend. A lot of the reason why the YARN backend works the way it does is because of backwards compatibility. IMO it would be much better to change the YARN backend to use spark-submit, because it would immensely simplify the code there. It was a nightmare to get YARN to reach feature parity with other backends because it has to pretty much reimplement everything. But doing that would break pretty much every Spark-on-YARN deployment, so it's not something we can do right now. For the other backends the situation is sort of similar; it probably wouldn't be hard to change standalone's DriverWrapper to also use spark-submit. But that brings potential side effects for existing users that don't exist with spark-on-k8s, because spark-on-k8s is new (the current fork aside). > But using init-containers makes it such that we don’t need to use > spark-submit at all Those are actually separate concerns. There are a whole bunch of things that spark-submit provides you that you'd have to replicate in the k8s backend if not using it. Thinks like properly handling special characters in arguments, native library paths, "userClassPathFirst", etc. You get them almost for free with spark-submit, and using an init container does not solve any of those for you. I'd say that using spark-submit is really not up for discussion here; it saves you from re-implementing a whole bunch of code that you shouldn't even be trying to re-implement. Separately, if there is a legitimate need for an init container, then it can be added. But I don't see that legitimate need right now, so I don't see what it's bringing other than complexity. (And no, "the k8s documentation mentions that init containers are sometimes used to download dependencies" is not a legitimate need.) -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: Kubernetes: why use init containers?
A crucial point here is considering whether we want to have a separate scheduler backend code path for client mode versus cluster mode. If we need such a separation in the code paths, it would be difficult to make it possible to run spark-submit in client mode from the driver container. We discussed this already when we started to think about client mode. See https://github.com/apache-spark-on-k8s/spark/pull/456. In our initial designs for a client mode, we considered that there are some concepts that would only apply to cluster mode and not to client mode – see https://github.com/apache-spark-on-k8s/spark/pull/456#issuecomment-343007093. But we haven’t worked out all of the details yet. The situation may work out such that client mode is similar enough to cluster mode that we can consider the cluster mode as being a spark-submit in client mode from a container. I’d imagine this is a reason why YARN hasn’t went with using spark-submit from the application master, because there are separate code paths for a YarnClientSchedulerBackend versus a YarnClusterSchedulerBackend, and the deploy mode serves as the switch between the two implementations. Though I am curious as to why Spark standalone isn’t using spark-submit – the DriverWrapper is manually fetching the user’s jars and putting them on a classloader before invoking the user’s main class with that classloader. But there’s only one scheduler backend for both client and cluster mode for standalone’s case. The main idea here is that we need to understand if we need different code paths for a client mode scheduler backend versus a cluster mode scheduler backend, before we can know if we can use spark-submit in client mode from the driver container. But using init-containers makes it such that we don’t need to use spark-submit at all, meaning that the differences can more or less be ignored at least in this particular context. -Matt Cheah On 1/10/18, 8:40 AM, "Marcelo Vanzin"wrote: On a side note, while it's great that you guys have meetings to discuss things related to the project, it's general Apache practice to discuss these things in the mailing list - or at the very list send detailed info about what discussed in these meetings to the mailing list. Not everybody can attend these meetings, and I'm not just talking about people being busy, but there are people who live in different time zones. Now that this code is moving into Spark I'd recommend getting people more involved with the Spark project to move things forward. On Tue, Jan 9, 2018 at 8:23 PM, Anirudh Ramanathan wrote: > Marcelo, I can see that we might be misunderstanding what this change > implies for performance and some of the deeper implementation details here. > We have a community meeting tomorrow (at 10am PT), and we'll be sure to > explore this idea in detail, and understand the implications and then get > back to you. > > Thanks for the detailed responses here, and for spending time with the idea. > (Also, you're more than welcome to attend the meeting - there's a link here > if you're around.) > > Cheers, > Anirudh > > > On Jan 9, 2018 8:05 PM, "Marcelo Vanzin" wrote: > > One thing I forgot in my previous e-mail is that if a resource is > remote I'm pretty sure (but haven't double checked the code) that > executors will download it directly from the remote server, and not > from the driver. So there, distributed download without an init > container. > > On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li wrote: >> The init-container is required for use with the resource staging server >> >> (https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_userdocs_blob_master_src_jekyll_running-2Don-2Dkubernetes.md-23resource-2Dstaging-2Dserver=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=rQzoyVLMucfZdPLZAwNlE-PZ90ViBJzTQ49K1dzjr3c=HcCtT_KLkPi_05ojHei1nbXUpwoJomou8bitD-WkYmI=). > > If the staging server *requires* an init container you have already a > design problem right there. > >> Additionally, the init-container is a Kubernetes >> native way of making sure that the dependencies are localized > > Sorry, but the init container does not do anything by itself. You had > to add a whole bunch of code to execute the existing Spark code in an > init container, when not doing it would have achieved the exact same > goal much more easily, in a way that is consistent with how Spark > already does things. > > Matt: >> the executors wouldn’t receive the jars on their class loader until after >> the executor starts > > I actually consider that a benefit. It means spark-on-k8s
Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.
Hi, All. Vectorized ORC Reader is now supported in Apache Spark 2.3. https://issues.apache.org/jira/browse/SPARK-16060 It has been a long journey. From now, Spark can read ORC files faster without feature penalty. Thank you for all your support, especially Wenchen Fan. It's done by two commits. [SPARK-16060][SQL] Support Vectorized ORC Reader https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a 9d6f2bf766 [SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc reader https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c faa5ba99d1 Please check OrcReadBenchmark for the final speed-up from `Hive built-in ORC` to `Native ORC Vectorized`. https://github.com/apache/spark/blob/master/sql/hive/ src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala Thank you. Bests, Dongjoon.
Re: Kubernetes: why use init containers?
On a side note, while it's great that you guys have meetings to discuss things related to the project, it's general Apache practice to discuss these things in the mailing list - or at the very list send detailed info about what discussed in these meetings to the mailing list. Not everybody can attend these meetings, and I'm not just talking about people being busy, but there are people who live in different time zones. Now that this code is moving into Spark I'd recommend getting people more involved with the Spark project to move things forward. On Tue, Jan 9, 2018 at 8:23 PM, Anirudh Ramanathanwrote: > Marcelo, I can see that we might be misunderstanding what this change > implies for performance and some of the deeper implementation details here. > We have a community meeting tomorrow (at 10am PT), and we'll be sure to > explore this idea in detail, and understand the implications and then get > back to you. > > Thanks for the detailed responses here, and for spending time with the idea. > (Also, you're more than welcome to attend the meeting - there's a link here > if you're around.) > > Cheers, > Anirudh > > > On Jan 9, 2018 8:05 PM, "Marcelo Vanzin" wrote: > > One thing I forgot in my previous e-mail is that if a resource is > remote I'm pretty sure (but haven't double checked the code) that > executors will download it directly from the remote server, and not > from the driver. So there, distributed download without an init > container. > > On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li wrote: >> The init-container is required for use with the resource staging server >> >> (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server). > > If the staging server *requires* an init container you have already a > design problem right there. > >> Additionally, the init-container is a Kubernetes >> native way of making sure that the dependencies are localized > > Sorry, but the init container does not do anything by itself. You had > to add a whole bunch of code to execute the existing Spark code in an > init container, when not doing it would have achieved the exact same > goal much more easily, in a way that is consistent with how Spark > already does things. > > Matt: >> the executors wouldn’t receive the jars on their class loader until after >> the executor starts > > I actually consider that a benefit. It means spark-on-k8s application > will behave more like all the other backends, where that is true also > (application jars live in a separate class loader). > >> traditionally meant to prepare the environment for the application that is >> to be run > > You guys are forcing this argument when it all depends on where you > draw the line. Spark can be launched without downloading any of those > dependencies, because Spark will download them for you. Forcing the > "kubernetes way" just means you're writing a lot more code, and > breaking the Spark app initialization into multiple container > invocations, to achieve the same thing. > >> would make the SparkSubmit code inadvertently allow running client mode >> Kubernetes applications as well > > Not necessarily. I have that in my patch; it doesn't allow client mode > unless a property that only the cluster mode submission code sets is > present. If some user wants to hack their way around that, more power > to them; users can also compile their own Spark without the checks if > they want to try out client mode in some way. > > Anirudh: >> Telling users that they must rebuild images ... every time seems less >> than convincing to me. > > Sure, I'm not proposing people use the docker image approach all the > time. It would be a hassle while developing an app, as it is kind of a > hassle today where the code doesn't upload local files to the k8s > cluster. > > But it's perfectly reasonable for people to optimize a production app > by bundling the app into a pre-built docker image to avoid > re-downloading resources every time. Like they'd probably place the > jar + dependencies on HDFS today with YARN, to get the benefits of the > YARN cache. > > -- > Marcelo > > - > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org > > -- Marcelo - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
spark streaming direct receiver offset initialization
In the class CachedKafkaConsumer.scala https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/sca la/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala what is the purpose of the following condition check in the method get(offset: Long, timeout: Long): ConsumerRecord[K, V] assert(record.offset == offset, s"Got wrong record for $groupId $topic $partition even after seeking to offset $offset") I have a production spark streaming job which after having worked for awhile (consumed kafka messages and updated/recorded offsets in kafka using = rdd.asInstanceOf[HasOffsetRanges].offsetRanges and dstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) ) on restart during the first attempt to resume message consumption, seems to be hitting the above assertion What is the purpose of that Assertion - i.e. what System Conditions related to e.g. the operation and interactions between Message Brokers and Message Consumers is it supposed to detect? The assertion is only available in the Spark Streaming Direct Consumer lib class and seems to be comparing the value of the Offset provided to Kafka to start reading from with the Offset of the message record returned by it (ie the Offset which is available as a field in the Record itself) For example something like the following ie Consumer Offset misalignment after Leader Failure and subsequent Leader Election? http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/ The last important Kafka cluster configuration property is unclean.leader.election.enable. It should be disabled (by default it is enabled) to avoid unrecoverable exceptions from Kafka consumer. Consider the situation when the latest committed offset is N, but after leader failure, the latest offset on the new leader is M < N. M < N because the new leader was elected from the lagging follower (not in-sync replica). When the streaming engine ask for data from offset N using Kafka consumer, it will get an exception because the offset N does not exist yet. Someone will have to fix offsets manually. So the minimal recommended Kafka setup for reliable message processing is: * 4 nodes in the cluster * unclean.leader.election.enable=false in the brokers configuration * replication factor for the topics - 3 * min.insync.replicas=2 property in topic configuration * ack=all property in the producer configuration * block.on.buffer.full=true property in the producer configuration With the above setup your configuration should be resistant to single broker failure, and Kafka consumers will survive new leader election. You could also take look at replica.lag.max.messages and replica.lag.time.max.ms properties for tuning when the follower is removed from ISR by the leader. But this is out of this blog post scope.