Re: [PSA] Python 2, 3.4 and 3.5 are now dropped

2020-07-13 Thread Hyukjin Kwon
cc user mailing list too.

2020년 7월 14일 (화) 오전 11:27, Hyukjin Kwon 님이 작성:

> I am sending another email to make sure dev people know. Python 2, 3.4 and
> 3.5 are now dropped at https://github.com/apache/spark/pull/28957.
>
>
>


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Anwar AliKhan
 link to a free book  which may be useful.

Hands-On Machine Learning with Scikit-Learn, Keras, and Tensorflow
Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien
Géron

https://bit.ly/2zxueGt





 13 Jul 2020, 15:18 Sean Owen,  wrote:

> There is a multilayer perceptron implementation in Spark ML, but
> that's not what you're looking for.
> To parallelize model training developed using standard libraries like
> Keras, use Horovod from Uber.
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
> >
> > Dear Spark User
> >
> > I am trying to parallelize the CNN (convolutional neural network) model
> using spark. I have developed the model using python and Keras library. The
> model works fine on a single machine but when we try on multiple machines,
> the execution time remains the same as sequential.
> > Could you please tell me that there is any built-in library for CNN to
> parallelize in spark framework. Moreover, MLLIB does not have any support
> for CNN.
> > Best regards
> > Mukhtaj
> >
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Anwar AliKhan
This is very useful for me leading on from week4 of the Andrew Ng course.


On Mon, 13 Jul 2020, 15:18 Sean Owen,  wrote:

> There is a multilayer perceptron implementation in Spark ML, but
> that's not what you're looking for.
> To parallelize model training developed using standard libraries like
> Keras, use Horovod from Uber.
> https://horovod.readthedocs.io/en/stable/spark_include.html
>
> On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
> >
> > Dear Spark User
> >
> > I am trying to parallelize the CNN (convolutional neural network) model
> using spark. I have developed the model using python and Keras library. The
> model works fine on a single machine but when we try on multiple machines,
> the execution time remains the same as sequential.
> > Could you please tell me that there is any built-in library for CNN to
> parallelize in spark framework. Moreover, MLLIB does not have any support
> for CNN.
> > Best regards
> > Mukhtaj
> >
> >
> >
> >
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Ivan Petrov
What do you mean "without conversion"?

def flatten(rdd: RDD[NestedStructure]): Dataset[MyCaseClass] = {
rdd.flatMap { nestedElement => flatten(nestedElement) /**
List[MyCaseClass] */ }
  .toDS()
}
Can it be better?

вт, 14 июл. 2020 г. в 01:13, Sean Owen :

> Wouldn't toDS() do this without conversion?
>
> On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov  wrote:
> >
> > Hi!
> > I'm trying to understand the cost of RDD to Dataset conversion
> > It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000
> records
> > It takes around 15 minutes to convert them to Dataset[MyCaseClass]
> > The shema of MyCaseClass is
> > str01: String,
> > str02: String,
> > str03: String,
> > str04: String,
> > long01: Long,
> > long02: Long,
> > double01: Double,
> > map: Map[String, Double]
> >
> > What can i do in order to run it faster?
>


Re: scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Sean Owen
Wouldn't toDS() do this without conversion?

On Mon, Jul 13, 2020 at 5:25 PM Ivan Petrov  wrote:
>
> Hi!
> I'm trying to understand the cost of RDD to Dataset conversion
> It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000 
> records
> It takes around 15 minutes to convert them to Dataset[MyCaseClass]
> The shema of MyCaseClass is
> str01: String,
> str02: String,
> str03: String,
> str04: String,
> long01: Long,
> long02: Long,
> double01: Double,
> map: Map[String, Double]
>
> What can i do in order to run it faster?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



scala RDD[MyCaseClass] to Dataset[MyCaseClass] perfomance

2020-07-13 Thread Ivan Petrov
Hi!
I'm trying to understand the cost of RDD to Dataset conversion
It takes me 60 minutes to create RDD [MyCaseClass] with 500.000.000.000
records
It takes around 15 minutes to convert them to Dataset[MyCaseClass]
The shema of MyCaseClass is
str01: String,
str02: String,
str03: String,
str04: String,
long01: Long,
long02: Long,
double01: Double,
map: Map[String, Double]

What can i do in order to run it faster?


Using Spark UI with Running Spark on Hadoop Yarn

2020-07-13 Thread ArtemisDev
Is there anyway to make the spark process visible via Spark UI when 
running Spark 3.0 on a Hadoop yarn cluster?  The spark documentation 
talked about replacing Spark UI with the spark history server, but 
didn't give much details.  Therefore I would assume it is still possible 
to use Spark UI when running spark on a hadoop yarn cluster.  Is this 
correct?   Does the spark history server have the same user functions as 
the Spark UI?


But how could this be possible (the possibility of using Spark UI) if 
the Spark master server isn't active when all the job scheduling and 
resource allocation tasks are replaced by yarn servers?


Thanks!

-- ND


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



org.apache.spark.deploy.yarn.ExecutorLauncher not found when running Spark 3.0 on Hadoop

2020-07-13 Thread ArtemisDev
I've been trying to set up the latest stable version of Spark 3.0 on a 
hadoop cluster using yarn.  When running spark-submit in client mode, I 
always got an error of org.apache.spark.deploy.yarn.ExecutorLauncher not 
found.  This happened when I preload the spark jar files onto HDFS and 
specified the spark.yarn.jars property to the HDFS address (i.e. set 
spark.yarn.jars to hdfs:///spark-3/jars or 
hdfs://namenode:8020/spark-3/jars).  I've checked the /spark-3/jars 
directory on HDFS and all the jar files are accessible.  The exception 
messages are listed below.


This problem won't occur when I commended out the spark.yarn.jars line 
in the spark-defaults.conf file.  spark-submit finishes without any 
problems.


Any ideas what I have done wrong?  Thanks!

-- ND

==

Exception in thread "main" org.apache.spark.SparkException: Application 
application_1594664166056_0005 failed 2 times due to AM Container for 
appattempt_1594664166056_0005_02 exited with exitCode: 1
Failing this attempt.Diagnostics: [2020-07-13 20:07:20.882]Exception 
from container-launch.

Container id: container_1594664166056_0005_02_01
Exit code: 1

[2020-07-13 20:07:20.886]Container exited with a non-zero exit code 1. 
Error file: prelaunch.err.

Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class 
org.apache.spark.deploy.yarn.ExecutorLauncher





Re: Blog : Apache Spark Window Functions

2020-07-13 Thread Anwar AliKhan
Further to the feedback you requested ,
I forgot to mention another point , that with the insight you will gain
after three weeks spent  on  that course,
You will be on par with the  aformentioned minority of engineers who are
helping their companies "make tons of money" a quote from Professor Andrew
Ng.

You will no longer be  part of the majority of engineers
who are spending six months on analytical projects when from day one YOU
can see that wasn't going to work. Another quote from professor Andrew Ng.


If you value the idea of joining the minority engineers "making tons of
money
for companies"  then the same three weeks spent on that course will yield
greater value comparatively spent on writing Apache spark examples of the
type you are currently engaged in.

I have gone past week 3 蘿so I have the insight.

It is against my personal values to use a product which is given on a trial
period basis , so I use Free Octave , a project  started 32 years ago.
You can profit by MATLAB 's investment.
You can watch MATLAB videos on how use and apply what you have learnt  to
Octave  because the syntax is exactly the same.

Then you can parallelise your octave  app on Apache Spark. You can use
Apache spark on a standalone whilst you prototype then with one line of
code,  change the parallelism to a distributed  parallelism across
cluster(s) of PCs.



On Fri, 10 Jul 2020, 04:50 Anwar AliKhan,  wrote:

> My opinion would be go here.
>
> https://www.coursera.org/courses?query=machine%20learning%20andrew%20ng
>
> Machine learning by Andrew Ng.
>
> After three weeks you will have more valuable skills than most engineers
> in silicon valley in the USA. I am past week 3. 蘿
>
> He does go 90 miles per hour.
> I  wish somebody had pointed me there as the starting point.
>
>
>
> On Thu, 25 Jun 2020, 18:58 neeraj bhadani, 
> wrote:
>
>> Hi Team,
>>  I would like to share with the community that my blog on "Apache
>> Spark Window Functions" got published. PFB link if anyone interested.
>>
>> Link:
>> https://medium.com/expedia-group-tech/deep-dive-into-apache-spark-window-functions-7b4e39ad3c86
>>
>> Please share your thoughts and feedback.
>>
>> Regards,
>> Neeraj
>>
>


Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Sean Owen
There is a multilayer perceptron implementation in Spark ML, but
that's not what you're looking for.
To parallelize model training developed using standard libraries like
Keras, use Horovod from Uber.
https://horovod.readthedocs.io/en/stable/spark_include.html

On Mon, Jul 13, 2020 at 6:59 AM Mukhtaj Khan  wrote:
>
> Dear Spark User
>
> I am trying to parallelize the CNN (convolutional neural network) model using 
> spark. I have developed the model using python and Keras library. The model 
> works fine on a single machine but when we try on multiple machines, the 
> execution time remains the same as sequential.
> Could you please tell me that there is any built-in library for CNN to 
> parallelize in spark framework. Moreover, MLLIB does not have any support for 
> CNN.
> Best regards
> Mukhtaj
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Juan Martín Guillén
 Hi Mukhtaj,
Parallelization on Spark is abstracted on the DataFrame.
You can run anything locally on the driver but to make it run in parallel on 
the cluster you'll need to use the DataFrame abstraction.
You may want to check maxpumperla/elephas.

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
maxpumperla/elephas

Distributed Deep learning with Keras & Spark. Contribute to maxpumperla/elephas 
development by creating an accou...
 |

 |

 |



Regards,Juan Martín.


El lunes, 13 de julio de 2020 08:59:35 ART, Mukhtaj Khan 
 escribió:  
 
 Dear Spark User
I am trying to parallelize the CNN (convolutional neural network) model using 
spark. I have developed the model using python and Keras library. The model 
works fine on a single machine but when we try on multiple machines, the 
execution time remains the same as sequential.Could you please tell me that 
there is any built-in library for CNN to parallelize in spark framework. 
Moreover, MLLIB does not have any support for CNN.Best regardsMukhtaj
  


  

Issue in parallelization of CNN model using spark

2020-07-13 Thread Mukhtaj Khan
Dear Spark User

I am trying to parallelize the CNN (convolutional neural network) model
using spark. I have developed the model using python and Keras library. The
model works fine on a single machine but when we try on multiple machines,
the execution time remains the same as sequential.
Could you please tell me that there is any built-in library for CNN to
parallelize in spark framework. Moreover, MLLIB does not have any support
for CNN.
Best regards
Mukhtaj