Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Hyukjin Kwon
I think we can deprecate it in 3.x.0 and remove it in Spark 4.0.0. Many
people still use Python 2. Also, techincally 2.7 support is not officially
dropped yet - https://pythonclock.org/


2018년 9월 17일 (월) 오전 9:31, Aakash Basu 님이 작성:

> Removing support for an API in a major release makes poor sense,
> deprecating is always better. Removal can always be done two - three minor
> release later.
>
> On Mon 17 Sep, 2018, 6:49 AM Felix Cheung, 
> wrote:
>
>> I don’t think we should remove any API even in a major release without
>> deprecating it first...
>>
>>
>> --
>> *From:* Mark Hamstra 
>> *Sent:* Sunday, September 16, 2018 12:26 PM
>> *To:* Erik Erlandson
>> *Cc:* user@spark.apache.org; dev
>> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>>
>> We could also deprecate Py2 already in the 2.4.0 release.
>>
>> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
>> wrote:
>>
>>> In case this didn't make it onto this thread:
>>>
>>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>>> remove it entirely on a later 3.x release.
>>>
>>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>>> wrote:
>>>
 On a separate dev@spark thread, I raised a question of whether or not
 to support python 2 in Apache Spark, going forward into Spark 3.0.

 Python-2 is going EOL  at
 the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
 make breaking changes to Spark's APIs, and so it is a good time to consider
 support for Python-2 on PySpark.

 Key advantages to dropping Python 2 are:

- Support for PySpark becomes significantly easier.
- Avoid having to support Python 2 until Spark 4.0, which is likely
to imply supporting Python 2 for some time after it goes EOL.

 (Note that supporting python 2 after EOL means, among other things,
 that PySpark would be supporting a version of python that was no longer
 receiving security patches)

 The main disadvantage is that PySpark users who have legacy python-2
 code would have to migrate their code to python 3 to take advantage of
 Spark 3.0

 This decision obviously has large implications for the Apache Spark
 community and we want to solicit community feedback.


>>>


Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Aakash Basu
Removing support for an API in a major release makes poor sense,
deprecating is always better. Removal can always be done two - three minor
release later.

On Mon 17 Sep, 2018, 6:49 AM Felix Cheung, 
wrote:

> I don’t think we should remove any API even in a major release without
> deprecating it first...
>
>
> --
> *From:* Mark Hamstra 
> *Sent:* Sunday, September 16, 2018 12:26 PM
> *To:* Erik Erlandson
> *Cc:* user@spark.apache.org; dev
> *Subject:* Re: Should python-2 be supported in Spark 3.0?
>
> We could also deprecate Py2 already in the 2.4.0 release.
>
> On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
> wrote:
>
>> In case this didn't make it onto this thread:
>>
>> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and
>> remove it entirely on a later 3.x release.
>>
>> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
>> wrote:
>>
>>> On a separate dev@spark thread, I raised a question of whether or not
>>> to support python 2 in Apache Spark, going forward into Spark 3.0.
>>>
>>> Python-2 is going EOL  at
>>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>>> make breaking changes to Spark's APIs, and so it is a good time to consider
>>> support for Python-2 on PySpark.
>>>
>>> Key advantages to dropping Python 2 are:
>>>
>>>- Support for PySpark becomes significantly easier.
>>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>>to imply supporting Python 2 for some time after it goes EOL.
>>>
>>> (Note that supporting python 2 after EOL means, among other things, that
>>> PySpark would be supporting a version of python that was no longer
>>> receiving security patches)
>>>
>>> The main disadvantage is that PySpark users who have legacy python-2
>>> code would have to migrate their code to python 3 to take advantage of
>>> Spark 3.0
>>>
>>> This decision obviously has large implications for the Apache Spark
>>> community and we want to solicit community feedback.
>>>
>>>
>>


Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Felix Cheung
I don’t think we should remove any API even in a major release without 
deprecating it first...



From: Mark Hamstra 
Sent: Sunday, September 16, 2018 12:26 PM
To: Erik Erlandson
Cc: user@spark.apache.org; dev
Subject: Re: Should python-2 be supported in Spark 3.0?

We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
mailto:eerla...@redhat.com>> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
mailto:eerla...@redhat.com>> wrote:
On a separate dev@spark thread, I raised a question of whether or not to 
support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end 
of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking 
changes to Spark's APIs, and so it is a good time to consider support for 
Python-2 on PySpark.

Key advantages to dropping Python 2 are:

  *   Support for PySpark becomes significantly easier.
  *   Avoid having to support Python 2 until Spark 4.0, which is likely to 
imply supporting Python 2 for some time after it goes EOL.

(Note that supporting python 2 after EOL means, among other things, that 
PySpark would be supporting a version of python that was no longer receiving 
security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would 
have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community 
and we want to solicit community feedback.




Re: Is there any open source framework that converts Cypher to SparkSQL?

2018-09-16 Thread Matei Zaharia
GraphFrames (https://graphframes.github.io) offers a Cypher-like syntax that 
then executes on Spark SQL.

> On Sep 14, 2018, at 2:42 AM, kant kodali  wrote:
> 
> Hi All,
> 
> Is there any open source framework that converts Cypher to SparkSQL?
> 
> Thanks!


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Mark Hamstra
We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson  wrote:

> In case this didn't make it onto this thread:
>
> There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove
> it entirely on a later 3.x release.
>
> On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
> wrote:
>
>> On a separate dev@spark thread, I raised a question of whether or not to
>> support python 2 in Apache Spark, going forward into Spark 3.0.
>>
>> Python-2 is going EOL  at
>> the end of 2019. The upcoming release of Spark 3.0 is an opportunity to
>> make breaking changes to Spark's APIs, and so it is a good time to consider
>> support for Python-2 on PySpark.
>>
>> Key advantages to dropping Python 2 are:
>>
>>- Support for PySpark becomes significantly easier.
>>- Avoid having to support Python 2 until Spark 4.0, which is likely
>>to imply supporting Python 2 for some time after it goes EOL.
>>
>> (Note that supporting python 2 after EOL means, among other things, that
>> PySpark would be supporting a version of python that was no longer
>> receiving security patches)
>>
>> The main disadvantage is that PySpark users who have legacy python-2 code
>> would have to migrate their code to python 3 to take advantage of Spark 3.0
>>
>> This decision obviously has large implications for the Apache Spark
>> community and we want to solicit community feedback.
>>
>>
>


Run spark tests on Windows/docker

2018-09-16 Thread Shmuel Blitz
Hi,

I'd like to build and run spark tests on my PC.

Build works fine on my Windows machine, but the tests can't run for various
reasons.

1. Is it possible to run the tests on a Windows without special magic?
2. If you need some magic, how complicated is it?
3. I thought about running the tests on a docker linux container with the
Spark build mounted from the host PC. Has anyone done this? Do you have a
recommended docker image to work with?
4. any special consideration I should think of?

Thanks,
Shmuel

-- 
Shmuel Blitz
Big Data Developer
Email: shmuel.bl...@similarweb.com
www.similarweb.com






Best practices on how to multiple spark sessions

2018-09-16 Thread unk1102
Hi I have application which servers as ETL job and I have hundreds of such
ETL jobs which runs daily now as of now I have just one spark session which
is shared by all these jobs and sometimes all of these jobs run at the same
time causing spark session to die due memory issues mostly. Is this a good
design? I am thinking to create multiple spark sessions possibly one spark
session for each ETL job but there is delay in starting spark session which
seems to multiple by no of ETL jobs. Please share best practices and designs
for such problems. Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



please help me: when I write code to connect kafka with spark using python and I run code on jupyer there is error display

2018-09-16 Thread hager
I write code to connect kafka with spark using python and I run code on
jupyer
my code
import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars
/home/hadoop/Desktop/spark-program/kafka/spark-streaming-kafka-0-8-assembly_2.10-2.0.0-preview.jar
pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages
org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 pyspark-shell"

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages
org.apache.spark:spark-streaming-kafka-0-8_2.11:2.3.0 pyspark-shell"

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

#sc = SparkContext()
ssc = StreamingContext(sc,1)

broker = "iotmsgs"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
{"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()

error display
Spark Streaming's Kafka libraries not found in class path. Try one of the
following.

  1. Include the Kafka library and its dependencies with in the
 spark-submit command as

 $ bin/spark-submit --packages
org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...

  2. Download the JAR of the artifact from Maven Central
http://search.maven.org/,
 Group Id = org.apache.spark, Artifact Id =
spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
 Then, include the jar in the spark-submit command as

 $ bin/spark-submit --jars  ... 



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: issue Running Spark Job on Yarn Cluster

2018-09-16 Thread sivasonai
Come across such issue in our project and got it resolved by clearing the
space under hdfs directory - "/user/spark". Please check if you have enough
space/privileges for this hdfs directory - "/user/spark"



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/