Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan
 wrote:
> We can start by getting a PR going perhaps, and start augmenting the
> integration testing to ensure that there are no surprises - with/without
> credentials, accessing GCS, S3 etc as well.
> When we get enough confidence and test coverage, let's merge this in.
> Does that sound like a reasonable path forward?

I think it's beneficial to separate this into two separate things as
far as discussion goes:

- using spark-submit: the code should definitely be starting the
driver using spark-submit, and potentially the executor using
spark-class.

- separately, we can decide on whether to keep or remove init containers.

Unfortunately, code-wise, those are not separate. If you get rid of
init containers, my current p.o.c. has most of the needed changes
(only lightly tested).

But if you keep init containers, you'll need to mess with the
configuration so that spark-submit never sees spark.jars /
spark.files, so it doesn't trigger its dependency download code. (YARN
does something similar, btw.) That will surely mean different changes
in the current k8s code (which I wanted to double check anyway because
I remember seeing some oddities related to those configs in the logs).

To comment on one point made by Andrew:
> there's almost a parallel here with spark.yarn.archive, where that configures 
> the cluster (YARN) to do distribution pre-runtime

That's more of a parallel to the docker image; spark.yarn.archive
points to a jar file with Spark jars in it so that YARN can make Spark
available to the driver / executors running in the cluster.

Like the docker image, you could include other stuff that is not
really part of standard Spark in that archive too, or even not have
Spark at all there, if you want things to just fail. :-)

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Andrew Ash
It seems we have two standard practices for resource distribution in place
here:

- the Spark way is that the application (Spark) distributes the resources
*during* app execution, and does this by exposing files/jars on an http
server on the driver (or pre-staged elsewhere), and executors downloading
from that location (driver or remote)
- the Kubernetes way is that the cluster manager (Kubernetes) distributes
the resources *before* app execution, and does this primarily via docker
images, and secondarily through init containers for non-image resources.
I'd imagine a motivation for this choice in k8s' part is immutability of
the application at runtime

When the Spark and K8s standard practices are in conflict (as they seem to
be here), which convention should be followed?

Looking at the Spark-on-YARN integration, there's almost a parallel here
with spark.yarn.archive, where that configures the cluster (YARN) to do
distribution pre-runtime instead of the application mid-runtime.

Based purely on the lines-of-code removal, right now I lean towards
eliminating init containers.  It doesn't seem like credential segregation
between init container and main pod container is that valuable right now,
and the retryability could/should be in all of Spark's cluster managers,
not just k8s.

So I support Anirudh's suggestion to move towards bringing the change
demonstrated in Marcelo's POC into master.

On Wed, Jan 10, 2018 at 3:00 PM, Anirudh Ramanathan <
ramanath...@google.com.invalid> wrote:

> Thanks for this discussion everyone. It has been very useful in getting an
> overall understanding here.
> I think in general, consensus is that this change doesn't introduce
> behavioral changes, and it's definitely an advantage to reuse the
> constructs that Spark provides to us.
>
> Moving on to a different question here - of pushing this through to Spark.
> The init containers have been tested over the past two Spark releases by
> external users and integration testing - and this would be a fundamental
> change to that behavior.
> We should work on getting enough test coverage and confidence here.
>
> We can start by getting a PR going perhaps, and start augmenting the
> integration testing to ensure that there are no surprises - with/without
> credentials, accessing GCS, S3 etc as well.
> When we get enough confidence and test coverage, let's merge this in.
> Does that sound like a reasonable path forward?
>
>
>
> On Wed, Jan 10, 2018 at 2:53 PM, Marcelo Vanzin 
> wrote:
>
>> On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheah  wrote:
>> > those sidecars may perform side effects that are undesirable if the
>> main Spark application failed because dependencies weren’t available
>>
>> If the contract is that the Spark driver pod does not have an init
>> container, and the driver handles its own dependencies, then by
>> definition that situation cannot exist.
>>
>> --
>> Marcelo
>>
>
>
>
> --
> Anirudh Ramanathan
>


Re: Kubernetes: why use init containers?

2018-01-10 Thread Anirudh Ramanathan
Thanks for this discussion everyone. It has been very useful in getting an
overall understanding here.
I think in general, consensus is that this change doesn't introduce
behavioral changes, and it's definitely an advantage to reuse the
constructs that Spark provides to us.

Moving on to a different question here - of pushing this through to Spark.
The init containers have been tested over the past two Spark releases by
external users and integration testing - and this would be a fundamental
change to that behavior.
We should work on getting enough test coverage and confidence here.

We can start by getting a PR going perhaps, and start augmenting the
integration testing to ensure that there are no surprises - with/without
credentials, accessing GCS, S3 etc as well.
When we get enough confidence and test coverage, let's merge this in.
Does that sound like a reasonable path forward?



On Wed, Jan 10, 2018 at 2:53 PM, Marcelo Vanzin  wrote:

> On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheah  wrote:
> > those sidecars may perform side effects that are undesirable if the main
> Spark application failed because dependencies weren’t available
>
> If the contract is that the Spark driver pod does not have an init
> container, and the driver handles its own dependencies, then by
> definition that situation cannot exist.
>
> --
> Marcelo
>



-- 
Anirudh Ramanathan


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:51 PM, Matt Cheah  wrote:
> those sidecars may perform side effects that are undesirable if the main 
> Spark application failed because dependencies weren’t available

If the contract is that the Spark driver pod does not have an init
container, and the driver handles its own dependencies, then by
definition that situation cannot exist.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
With regards to separation of concerns, there’s a fringe use case here – if 
more than one main container is on the pod, then none of them will run if the 
init-containers fail. A user can have a Pod Preset that attaches more sidecar 
containers to the driver and/or executors. In that case, those sidecars may 
perform side effects that are undesirable if the main Spark application failed 
because dependencies weren’t available. Using the init-container to localize 
the dependencies will prevent any of these sidecars from executing at all if 
the dependencies can’t be fetched.

It’s definitely a niche use case – I’m not sure how often pod presets are used 
in practice - but it’s an example to illustrate why the separation of concerns 
can be beneficial.

-Matt Cheah

On 1/10/18, 2:36 PM, "Marcelo Vanzin"  wrote:

On Wed, Jan 10, 2018 at 2:30 PM, Yinan Li  wrote:
> 1. Retries of init-containers are automatically supported by k8s through 
pod
> restart policies. For this point, sorry I'm not sure how spark-submit
> achieves this.

Great, add that feature to spark-submit, everybody benefits, not just k8s.

> 2. The ability to use credentials that are not shared with the main
> containers.

Not sure what that achieves.

> 3. Not only the user code, but Spark internal code like Executor won't be
> run if the init-container fails.

Not sure what that achieves. Executor will fail if dependency download
fails, Spark driver will recover (and start a new executor if needed).

> 4. Easier to build tooling around k8s events/status of the init-container 
in
> case of failures as it's doing exactly one thing: downloading 
dependencies.

Again, I don't see what is all this hoopla about fine grained control
of dependency downloads. Spark solved this years ago for Spark
applications. Don't reinvent the wheel.

-- 
Marcelo



smime.p7s
Description: S/MIME cryptographic signature


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:30 PM, Yinan Li  wrote:
> 1. Retries of init-containers are automatically supported by k8s through pod
> restart policies. For this point, sorry I'm not sure how spark-submit
> achieves this.

Great, add that feature to spark-submit, everybody benefits, not just k8s.

> 2. The ability to use credentials that are not shared with the main
> containers.

Not sure what that achieves.

> 3. Not only the user code, but Spark internal code like Executor won't be
> run if the init-container fails.

Not sure what that achieves. Executor will fail if dependency download
fails, Spark driver will recover (and start a new executor if needed).

> 4. Easier to build tooling around k8s events/status of the init-container in
> case of failures as it's doing exactly one thing: downloading dependencies.

Again, I don't see what is all this hoopla about fine grained control
of dependency downloads. Spark solved this years ago for Spark
applications. Don't reinvent the wheel.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Yinan Li
> Sorry, but what are those again? So far all the benefits are already
> provided by spark-submit...

1. Retries of init-containers are automatically supported by k8s through
pod restart policies. For this point, sorry I'm not sure how spark-submit
achieves this.
2. The ability to use credentials that are not shared with the main
containers.
3. Not only the user code, but Spark internal code like Executor won't be
run if the init-container fails.
4. Easier to build tooling around k8s events/status of the init-container
in case of failures as it's doing exactly one thing: downloading
dependencies.

There could be others that I'm not aware of.



On Wed, Jan 10, 2018 at 2:21 PM, Marcelo Vanzin  wrote:

> On Wed, Jan 10, 2018 at 2:16 PM, Yinan Li  wrote:
> > but we can not rule out the benefits init-containers bring either.
>
> Sorry, but what are those again? So far all the benefits are already
> provided by spark-submit...
>
> > Again, I would suggest we look at this more thoroughly post 2.3.
>
> Actually, one of the reasons why I brought this up is that we should
> remove init containers from 2.3 unless they're really required for
> something.
>
> Simplifying the code is not the only issue. The init container support
> introduces a whole lot of user-visible behavior - like config options
> and the execution of a completely separate container that the user can
> customize. If removed later, that could be considered a breaking
> change.
>
> So if we ship 2.3 without init containers and add them later if
> needed, it's a much better world than flipping that around.
>
> --
> Marcelo
>


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:16 PM, Yinan Li  wrote:
> but we can not rule out the benefits init-containers bring either.

Sorry, but what are those again? So far all the benefits are already
provided by spark-submit...

> Again, I would suggest we look at this more thoroughly post 2.3.

Actually, one of the reasons why I brought this up is that we should
remove init containers from 2.3 unless they're really required for
something.

Simplifying the code is not the only issue. The init container support
introduces a whole lot of user-visible behavior - like config options
and the execution of a completely separate container that the user can
customize. If removed later, that could be considered a breaking
change.

So if we ship 2.3 without init containers and add them later if
needed, it's a much better world than flipping that around.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Yinan Li
> 1500 less lines of code trump all of the arguments given so far for
> what the init container might be a good idea.

We can also reduce the #lines of code by simply refactoring the code in
such as way that a lot of code can be shared between configuration of the
main container and that of the ini-container. Actually we have been
discussing this as one of the things to do right after the 2.3 release and
we do have a Jira ticket to track it. It's probably true that none of the
arguments we made are convincing enough, but we can not rule out the
benefits init-containers bring either.

Again, I would suggest we look at this more thoroughly post 2.3.

On Wed, Jan 10, 2018 at 2:06 PM, Marcelo Vanzin  wrote:

> On Wed, Jan 10, 2018 at 2:00 PM, Yinan Li  wrote:
> > I want to re-iterate on one point, that the init-container achieves a
> clear
> > separation between preparing an application and actually running the
> > application. It's a guarantee provided by the K8s admission control and
> > scheduling components that if the init-container fails, the main
> container
> > won't be run. I think this is definitely positive to have. In the case
> of a
> > Spark application, the application code and driver/executor code won't
> even
> > be run if the init-container fails to localize any of the dependencies
>
> That is also the case with spark-submit... (can't download
> dependencies -> spark-submit fails before running user code).
>
> > Note that we are not blindly opposing getting rid of the init-container,
> > it's just that there's still valid reasons to keep it for now
>
> I'll flip that around: I'm not against having an init container if
> it's serving a needed purpose, it's just that nobody is able to tell
> me what that needed purpose is.
>
> 1500 less lines of code trump all of the arguments given so far for
> what the init container might be a good idea.
>
> --
> Marcelo
>


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 2:00 PM, Yinan Li  wrote:
> I want to re-iterate on one point, that the init-container achieves a clear
> separation between preparing an application and actually running the
> application. It's a guarantee provided by the K8s admission control and
> scheduling components that if the init-container fails, the main container
> won't be run. I think this is definitely positive to have. In the case of a
> Spark application, the application code and driver/executor code won't even
> be run if the init-container fails to localize any of the dependencies

That is also the case with spark-submit... (can't download
dependencies -> spark-submit fails before running user code).

> Note that we are not blindly opposing getting rid of the init-container,
> it's just that there's still valid reasons to keep it for now

I'll flip that around: I'm not against having an init container if
it's serving a needed purpose, it's just that nobody is able to tell
me what that needed purpose is.

1500 less lines of code trump all of the arguments given so far for
what the init container might be a good idea.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Yinan Li
I want to re-iterate on one point, that the init-container achieves a clear
separation between preparing an application and actually running the
application. It's a guarantee provided by the K8s admission control and
scheduling components that if the init-container fails, the main container
won't be run. I think this is definitely positive to have. In the case of a
Spark application, the application code and driver/executor code won't even
be run if the init-container fails to localize any of the dependencies. The
result is that it's much easier for users to figure out what's wrong if
their applications fail to run: they can tell if the pods are initialized
or not and if not, simply check the status/logs of the init-container.
Another argument I want to make is we can easily make the init-container to
be able to exclusively use certain credentials for downloading dependencies
that are not appropriate to be visible in the main containers and therefore
should not be shared. This is not achievable using the Spark canonical way.
K8s has built-in support for dynamically injecting containers into pods
through the admission control process. One use case would be for cluster
operators to inject an init-container (e.g., through a admission webhook)
for downloading certain dependencies that require certain
access-restrictive credentials.

Note that we are not blindly opposing getting rid of the init-container,
it's just that there's still valid reasons to keep it for now, particularly
given that we don't have a solid around client mode yet. Also given that we
have been using it in our fork for over a year, we are definitely more
confident on the current way of handling remote dependencies as it's been
tested more thoroughly. Since getting rid of the init-container is such a
significant change, I would suggest that we defer making a decision on if
we should get rid of it to 2.4 so we have a more thorough understanding of
the pros and cons.

On Wed, Jan 10, 2018 at 1:48 PM, Marcelo Vanzin  wrote:

> On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheah  wrote:
> >> With a config value set by the submission code, like what I'm doing to
> prevent client mode submission in my p.o.c.?
> >
> > The contract for what determines the appropriate scheduler backend to
> instantiate is then going to be different in Kubernetes versus the other
> cluster managers.
>
> There is no contract for how to pick the appropriate scheduler. That's
> a decision that is completely internal to the cluster manager code
>
> --
> Marcelo
>


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:47 PM, Matt Cheah  wrote:
>> With a config value set by the submission code, like what I'm doing to 
>> prevent client mode submission in my p.o.c.?
>
> The contract for what determines the appropriate scheduler backend to 
> instantiate is then going to be different in Kubernetes versus the other 
> cluster managers.

There is no contract for how to pick the appropriate scheduler. That's
a decision that is completely internal to the cluster manager code

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
> With a config value set by the submission code, like what I'm doing to 
> prevent client mode submission in my p.o.c.?

The contract for what determines the appropriate scheduler backend to 
instantiate is then going to be different in Kubernetes versus the other 
cluster managers. The cluster manager typically only picks the scheduler 
backend implementation based on the master URL format plus the deploy mode. 
Perhaps this is an acceptable tradeoff for being able to leverage spark-submit 
in the cluster mode deployed driver container. Again though, any flag we expose 
in spark-submit is a user-facing option that can be set erroneously, which is a 
practice we shouldn’t be encouraging.

Taking a step back though, I think we want to use spark-submit’s internals 
without using spark-submit itself. Any flags we add to spark-submit are 
user-facing. We ideally would be able to extract the dependency download + run 
user main class subroutines from spark-submit, and invoke that in all of the 
cluster managers. Perhaps this calls for a refactor in spark-submit itself to 
make some parts reusable in other contexts. Just an idea.

On 1/10/18, 1:38 PM, "Marcelo Vanzin"  wrote:

On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah  wrote:
> If we use spark-submit in client mode from the driver container, how do 
we handle needing to switch between a cluster-mode scheduler backend and a 
client-mode scheduler backend in the future?

With a config value set by the submission code, like what I'm doing to
prevent client mode submission in my p.o.c.?

There are plenty of solutions to that problem if that's what's worrying you.

> Something else re: client mode accessibility – if we make client mode 
accessible to users even if it’s behind a flag, that’s a very different 
contract from needing to recompile spark-submit to support client mode. The 
amount of effort required from the user to get to client mode is very different 
between the two cases

Yes. But if we say we don't support client mode, we don't support
client mode regardless of how easy it is for the user to fool Spark
into trying to run in that mode.

-- 
Marcelo



smime.p7s
Description: S/MIME cryptographic signature


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:33 PM, Matt Cheah  wrote:
> If we use spark-submit in client mode from the driver container, how do we 
> handle needing to switch between a cluster-mode scheduler backend and a 
> client-mode scheduler backend in the future?

With a config value set by the submission code, like what I'm doing to
prevent client mode submission in my p.o.c.?

There are plenty of solutions to that problem if that's what's worrying you.

> Something else re: client mode accessibility – if we make client mode 
> accessible to users even if it’s behind a flag, that’s a very different 
> contract from needing to recompile spark-submit to support client mode. The 
> amount of effort required from the user to get to client mode is very 
> different between the two cases

Yes. But if we say we don't support client mode, we don't support
client mode regardless of how easy it is for the user to fool Spark
into trying to run in that mode.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
If we use spark-submit in client mode from the driver container, how do we 
handle needing to switch between a cluster-mode scheduler backend and a 
client-mode scheduler backend in the future?

Something else re: client mode accessibility – if we make client mode 
accessible to users even if it’s behind a flag, that’s a very different 
contract from needing to recompile spark-submit to support client mode. The 
amount of effort required from the user to get to client mode is very different 
between the two cases, and the contract is much clearer when client mode is 
forbidden in all circumstances, versus client mode being allowed with a 
specific flag. If we’re saying that we don’t support client mode, we should 
bias towards making client mode as difficult as possible to access, i.e. 
impossible with a standard Spark distribution.

-Matt Cheah

On 1/10/18, 1:24 PM, "Marcelo Vanzin"  wrote:

On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheah  wrote:
> I’d imagine this is a reason why YARN hasn’t went with using spark-submit 
from the application master...

I wouldn't use YARN as a template to follow when writing a new
backend. A lot of the reason why the YARN backend works the way it
does is because of backwards compatibility. IMO it would be much
better to change the YARN backend to use spark-submit, because it
would immensely simplify the code there. It was a nightmare to get
YARN to reach feature parity with other backends because it has to
pretty much reimplement everything.

But doing that would break pretty much every Spark-on-YARN deployment,
so it's not something we can do right now.

For the other backends the situation is sort of similar; it probably
wouldn't be hard to change standalone's DriverWrapper to also use
spark-submit. But that brings potential side effects for existing
users that don't exist with spark-on-k8s, because spark-on-k8s is new
(the current fork aside).

>  But using init-containers makes it such that we don’t need to use 
spark-submit at all

Those are actually separate concerns. There are a whole bunch of
things that spark-submit provides you that you'd have to replicate in
the k8s backend if not using it. Thinks like properly handling special
characters in arguments, native library paths, "userClassPathFirst",
etc. You get them almost for free with spark-submit, and using an init
container does not solve any of those for you.

I'd say that using spark-submit is really not up for discussion here;
it saves you from re-implementing a whole bunch of code that you
shouldn't even be trying to re-implement.

Separately, if there is a legitimate need for an init container, then
it can be added. But I don't see that legitimate need right now, so I
don't see what it's bringing other than complexity.

(And no, "the k8s documentation mentions that init containers are
sometimes used to download dependencies" is not a legitimate need.)

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




smime.p7s
Description: S/MIME cryptographic signature


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On Wed, Jan 10, 2018 at 1:10 PM, Matt Cheah  wrote:
> I’d imagine this is a reason why YARN hasn’t went with using spark-submit 
> from the application master...

I wouldn't use YARN as a template to follow when writing a new
backend. A lot of the reason why the YARN backend works the way it
does is because of backwards compatibility. IMO it would be much
better to change the YARN backend to use spark-submit, because it
would immensely simplify the code there. It was a nightmare to get
YARN to reach feature parity with other backends because it has to
pretty much reimplement everything.

But doing that would break pretty much every Spark-on-YARN deployment,
so it's not something we can do right now.

For the other backends the situation is sort of similar; it probably
wouldn't be hard to change standalone's DriverWrapper to also use
spark-submit. But that brings potential side effects for existing
users that don't exist with spark-on-k8s, because spark-on-k8s is new
(the current fork aside).

>  But using init-containers makes it such that we don’t need to use 
> spark-submit at all

Those are actually separate concerns. There are a whole bunch of
things that spark-submit provides you that you'd have to replicate in
the k8s backend if not using it. Thinks like properly handling special
characters in arguments, native library paths, "userClassPathFirst",
etc. You get them almost for free with spark-submit, and using an init
container does not solve any of those for you.

I'd say that using spark-submit is really not up for discussion here;
it saves you from re-implementing a whole bunch of code that you
shouldn't even be trying to re-implement.

Separately, if there is a legitimate need for an init container, then
it can be added. But I don't see that legitimate need right now, so I
don't see what it's bringing other than complexity.

(And no, "the k8s documentation mentions that init containers are
sometimes used to download dependencies" is not a legitimate need.)

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-10 Thread Matt Cheah
A crucial point here is considering whether we want to have a separate 
scheduler backend code path for client mode versus cluster mode. If we need 
such a separation in the code paths, it would be difficult to make it possible 
to run spark-submit in client mode from the driver container.

We discussed this already when we started to think about client mode. See 
https://github.com/apache-spark-on-k8s/spark/pull/456. In our initial designs 
for a client mode, we considered that there are some concepts that would only 
apply to cluster mode and not to client mode – see 
https://github.com/apache-spark-on-k8s/spark/pull/456#issuecomment-343007093. 
But we haven’t worked out all of the details yet. The situation may work out 
such that client mode is similar enough to cluster mode that we can consider 
the cluster mode as being a spark-submit in client mode from a container.

I’d imagine this is a reason why YARN hasn’t went with using spark-submit from 
the application master, because there are separate code paths for a 
YarnClientSchedulerBackend versus a YarnClusterSchedulerBackend, and the deploy 
mode serves as the switch between the two implementations. Though I am curious 
as to why Spark standalone isn’t using spark-submit – the DriverWrapper is 
manually fetching the user’s jars and putting them on a classloader before 
invoking the user’s main class with that classloader. But there’s only one 
scheduler backend for both client and cluster mode for standalone’s case.

The main idea here is that we need to understand if we need different code 
paths for a client mode scheduler backend versus a cluster mode scheduler 
backend, before we can know if we can use spark-submit in client mode from the 
driver container. But using init-containers makes it such that we don’t need to 
use spark-submit at all, meaning that the differences can more or less be 
ignored at least in this particular context.

-Matt Cheah

On 1/10/18, 8:40 AM, "Marcelo Vanzin"  wrote:

On a side note, while it's great that you guys have meetings to
discuss things related to the project, it's general Apache practice to
discuss these things in the mailing list - or at the very list send
detailed info about what discussed in these meetings to the mailing
list. Not everybody can attend these meetings, and I'm not just
talking about people being busy, but there are people who live in
different time zones.

Now that this code is moving into Spark I'd recommend getting people
more involved with the Spark project to move things forward.

On Tue, Jan 9, 2018 at 8:23 PM, Anirudh Ramanathan
 wrote:
> Marcelo, I can see that we might be misunderstanding what this change
> implies for performance and some of the deeper implementation details 
here.
> We have a community meeting tomorrow (at 10am PT), and we'll be sure to
> explore this idea in detail, and understand the implications and then get
> back to you.
>
> Thanks for the detailed responses here, and for spending time with the 
idea.
> (Also, you're more than welcome to attend the meeting - there's a link 
here
> if you're around.)
>
> Cheers,
> Anirudh
>
>
> On Jan 9, 2018 8:05 PM, "Marcelo Vanzin"  wrote:
>
> One thing I forgot in my previous e-mail is that if a resource is
> remote I'm pretty sure (but haven't double checked the code) that
> executors will download it directly from the remote server, and not
> from the driver. So there, distributed download without an init
> container.
>
> On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li  wrote:
>> The init-container is required for use with the resource staging server
>>
>> 
(https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache-2Dspark-2Don-2Dk8s_userdocs_blob_master_src_jekyll_running-2Don-2Dkubernetes.md-23resource-2Dstaging-2Dserver=DwIFaQ=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8=hzwIMNQ9E99EMYGuqHI0kXhVbvX3nU3OSDadUnJxjAs=rQzoyVLMucfZdPLZAwNlE-PZ90ViBJzTQ49K1dzjr3c=HcCtT_KLkPi_05ojHei1nbXUpwoJomou8bitD-WkYmI=).
>
> If the staging server *requires* an init container you have already a
> design problem right there.
>
>> Additionally, the init-container is a Kubernetes
>> native way of making sure that the dependencies are localized
>
> Sorry, but the init container does not do anything by itself. You had
> to add a whole bunch of code to execute the existing Spark code in an
> init container, when not doing it would have achieved the exact same
> goal much more easily, in a way that is consistent with how Spark
> already does things.
>
> Matt:
>> the executors wouldn’t receive the jars on their class loader until after
>> the executor starts
>
> I actually consider that a benefit. It means spark-on-k8s 

Vectorized ORC Reader in Apache Spark 2.3 with Apache ORC 1.4.1.

2018-01-10 Thread Dongjoon Hyun
Hi, All.

Vectorized ORC Reader is now supported in Apache Spark 2.3.

https://issues.apache.org/jira/browse/SPARK-16060

It has been a long journey. From now, Spark can read ORC files faster
without feature penalty.

Thank you for all your support, especially Wenchen Fan.

It's done by two commits.

[SPARK-16060][SQL] Support Vectorized ORC Reader
https://github.com/apache/spark/commit/f44ba910f58083458e1133502e193a
9d6f2bf766

[SPARK-16060][SQL][FOLLOW-UP] add a wrapper solution for vectorized orc
reader
https://github.com/apache/spark/commit/eaac60a1e20e29084b7151ffca964c
faa5ba99d1

Please check OrcReadBenchmark for the final speed-up from `Hive built-in
ORC` to `Native ORC Vectorized`.

https://github.com/apache/spark/blob/master/sql/hive/
src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala

Thank you.

Bests,
Dongjoon.


Re: Kubernetes: why use init containers?

2018-01-10 Thread Marcelo Vanzin
On a side note, while it's great that you guys have meetings to
discuss things related to the project, it's general Apache practice to
discuss these things in the mailing list - or at the very list send
detailed info about what discussed in these meetings to the mailing
list. Not everybody can attend these meetings, and I'm not just
talking about people being busy, but there are people who live in
different time zones.

Now that this code is moving into Spark I'd recommend getting people
more involved with the Spark project to move things forward.

On Tue, Jan 9, 2018 at 8:23 PM, Anirudh Ramanathan
 wrote:
> Marcelo, I can see that we might be misunderstanding what this change
> implies for performance and some of the deeper implementation details here.
> We have a community meeting tomorrow (at 10am PT), and we'll be sure to
> explore this idea in detail, and understand the implications and then get
> back to you.
>
> Thanks for the detailed responses here, and for spending time with the idea.
> (Also, you're more than welcome to attend the meeting - there's a link here
> if you're around.)
>
> Cheers,
> Anirudh
>
>
> On Jan 9, 2018 8:05 PM, "Marcelo Vanzin"  wrote:
>
> One thing I forgot in my previous e-mail is that if a resource is
> remote I'm pretty sure (but haven't double checked the code) that
> executors will download it directly from the remote server, and not
> from the driver. So there, distributed download without an init
> container.
>
> On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li  wrote:
>> The init-container is required for use with the resource staging server
>>
>> (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server).
>
> If the staging server *requires* an init container you have already a
> design problem right there.
>
>> Additionally, the init-container is a Kubernetes
>> native way of making sure that the dependencies are localized
>
> Sorry, but the init container does not do anything by itself. You had
> to add a whole bunch of code to execute the existing Spark code in an
> init container, when not doing it would have achieved the exact same
> goal much more easily, in a way that is consistent with how Spark
> already does things.
>
> Matt:
>> the executors wouldn’t receive the jars on their class loader until after
>> the executor starts
>
> I actually consider that a benefit. It means spark-on-k8s application
> will behave more like all the other backends, where that is true also
> (application jars live in a separate class loader).
>
>> traditionally meant to prepare the environment for the application that is
>> to be run
>
> You guys are forcing this argument when it all depends on where you
> draw the line. Spark can be launched without downloading any of those
> dependencies, because Spark will download them for you. Forcing the
> "kubernetes way" just means you're writing a lot more code, and
> breaking the Spark app initialization into multiple container
> invocations, to achieve the same thing.
>
>> would make the SparkSubmit code inadvertently allow running client mode
>> Kubernetes applications as well
>
> Not necessarily. I have that in my patch; it doesn't allow client mode
> unless a property that only the cluster mode submission code sets is
> present. If some user wants to hack their way around that, more power
> to them; users can also compile their own Spark without the checks if
> they want to try out client mode in some way.
>
> Anirudh:
>> Telling users that they must rebuild images  ... every time seems less
>> than convincing to me.
>
> Sure, I'm not proposing people use the docker image approach all the
> time. It would be a hassle while developing an app, as it is kind of a
> hassle today where the code doesn't upload local files to the k8s
> cluster.
>
> But it's perfectly reasonable for people to optimize a production app
> by bundling the app into a pre-built docker image to avoid
> re-downloading resources every time. Like they'd probably place the
> jar + dependencies on HDFS today with YARN, to get the benefits of the
> YARN cache.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



spark streaming direct receiver offset initialization

2018-01-10 Thread Evo Eftimov
In the class CachedKafkaConsumer.scala 

 

https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/sca
la/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala 

 

what is the purpose of the following condition check in the method
get(offset: Long, timeout: Long): ConsumerRecord[K, V]

 


assert(record.offset == offset,


s"Got wrong record for $groupId $topic $partition even after seeking to
offset $offset")

 

I have a production spark streaming job which after having worked for awhile
(consumed kafka messages and updated/recorded offsets in kafka using =
rdd.asInstanceOf[HasOffsetRanges].offsetRanges 
and dstream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges) ) on
restart during the first attempt to resume message consumption, seems to be
hitting the above assertion 

 

What is the purpose of that Assertion - i.e. what System Conditions related
to e.g. the operation and interactions between Message Brokers and Message
Consumers is it supposed to detect? The assertion is only available in the
Spark Streaming Direct Consumer lib class and seems to be comparing the
value of the Offset provided to Kafka to start reading from with the Offset
of the message record returned by it (ie the Offset which is available as a
field in the Record itself)

 

For example something like the following ie Consumer Offset misalignment
after Leader Failure and subsequent Leader Election?  

http://mkuthan.github.io/blog/2016/01/29/spark-kafka-integration2/ 

The last important Kafka cluster configuration property is
unclean.leader.election.enable. It should be disabled (by default it is
enabled) to avoid unrecoverable exceptions from Kafka consumer. Consider the
situation when the latest committed offset is N, but after leader failure,
the latest offset on the new leader is M < N. M < N because the new leader
was elected from the lagging follower (not in-sync replica). When the
streaming engine ask for data from offset N using Kafka consumer, it will
get an exception because the offset N does not exist yet. Someone will have
to fix offsets manually.

So the minimal recommended Kafka setup for reliable message processing is:

*   4 nodes in the cluster
*   unclean.leader.election.enable=false in the brokers configuration
*   replication factor for the topics - 3
*   min.insync.replicas=2 property in topic configuration
*   ack=all property in the producer configuration
*   block.on.buffer.full=true property in the producer configuration

With the above setup your configuration should be resistant to single broker
failure, and Kafka consumers will survive new leader election.

You could also take look at replica.lag.max.messages and
replica.lag.time.max.ms properties for tuning when the follower is removed
from ISR by the leader. But this is out of this blog post scope.