RE: Support STS to run in k8s deployment with spark deployment mode as cluster

2018-09-15 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi,
Following is the bug to track the same.

https://issues.apache.org/jira/browse/SPARK-25442

Regards
Surya

From: Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Sent: Sunday, September 16, 2018 10:15 AM
To: d...@spark.apache.org; Ilan Filonenko 
Cc: user@spark.apache.org; Imandi, Srinivas (Nokia - IN/Bangalore) 
; Chakradhar, N R (Nokia - IN/Bangalore) 
; Rao, Abhishek (Nokia - IN/Bangalore) 

Subject: Support STS to run in k8s deployment with spark deployment mode as 
cluster

Hi All,
I would like to propose the following changes for supporting the STS to run in 
k8s deployments with spark deployment mode as cluster.

PR: https://github.com/apache/spark/pull/22433

Can you please review and provide the comments?


Regards
Surya



unsubscribe

2018-09-15 Thread Or Rappel-Kroyzer



Support STS to run in k8s deployment with spark deployment mode as cluster

2018-09-15 Thread Garlapati, Suryanarayana (Nokia - IN/Bangalore)
Hi All,
I would like to propose the following changes for supporting the STS to run in 
k8s deployments with spark deployment mode as cluster.

PR: https://github.com/apache/spark/pull/22433

Can you please review and provide the comments?


Regards
Surya



FlatMapGroupsFunction Without Running Out of Memory For Large Groups

2018-09-15 Thread ddukek
This question is more about a type of processing that I haven't been able to
really find a good solution for within spark than it is about the
FlatMapGroupsFunction specifically. However, the FlatMapGroupsFunction
serves as a good concrete example of what I'm trying to talk about. 

In MapReduce if I want to do something like replicate each record that maps
to a particular key (100, 1000, 1.. etc) times I can because for each
record I'll just set up a loop for the number of replications that I want
and use the context.write(outKey, outValue) method to serialize out the
data. The amount of memory that I would ever be required to use would be the
size of one of the objects that I'm replicating plus some input and output
buffer overhead. 

Now if I want to do this in Spark there are a couple ways I could try to do
this, but I'm not sure any of them would work in the limit of replicating my
data so much that I would run my executor out of memory. The interface for
the FlatMapGroupsFunction is a single function call
http://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/function/FlatMapGroupsFunction.html
. The return type here is an Iterator which implies that I would have to
materialize N replications of my data in memory in order to return an
iterator over it. Is there not a way that I can return data to Spark such
that it buffers my output records for as long as it continues to have
available memory and then it can spill it to disk itself? This idea isn't
limited to the Dataset API. The api for the pyspark rdd flatmap function has
the same implications.

Again I'm not trying to create the worlds greatest data cloning application.
It just serves the purpose of an example. Basically I would like a memory
safe way to let the framework handle me creating more data within a single
task context than can be contained by an executor.

Thanks in advance for any info people can provide. If you don't necessarily
have an answer I'm happy to kind of brainstorm on potential solutions as
well.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
it entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
wrote:

> On a separate dev@spark thread, I raised a question of whether or not to
> support python 2 in Apache Spark, going forward into Spark 3.0.
>
> Python-2 is going EOL  at
> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
> make breaking changes to Spark's APIs, and so it is a good time to consider
> support for Python-2 on PySpark.
>
> Key advantages to dropping Python 2 are:
>
>- Support for PySpark becomes significantly easier.
>- Avoid having to support Python 2 until Spark 4.0, which is likely to
>imply supporting Python 2 for some time after it goes EOL.
>
> (Note that supporting python 2 after EOL means, among other things, that
> PySpark would be supporting a version of python that was no longer
> receiving security patches)
>
> The main disadvantage is that PySpark users who have legacy python-2 code
> would have to migrate their code to python 3 to take advantage of Spark 3.0
>
> This decision obviously has large implications for the Apache Spark
> community and we want to solicit community feedback.
>
>


Should python-2 be supported in Spark 3.0?

2018-09-15 Thread Erik Erlandson
On a separate dev@spark thread, I raised a question of whether or not to
support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL  at the
end of 2019. The upcoming release of Spark 3.0 is an opportunity to make
breaking changes to Spark's APIs, and so it is a good time to consider
support for Python-2 on PySpark.

Key advantages to dropping Python 2 are:

   - Support for PySpark becomes significantly easier.
   - Avoid having to support Python 2 until Spark 4.0, which is likely to
   imply supporting Python 2 for some time after it goes EOL.

(Note that supporting python 2 after EOL means, among other things, that
PySpark would be supporting a version of python that was no longer
receiving security patches)

The main disadvantage is that PySpark users who have legacy python-2 code
would have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark
community and we want to solicit community feedback.