Re: Issue in parallelization of CNN model using spark

2020-07-13 Thread Juan Martín Guillén
 Hi Mukhtaj,
Parallelization on Spark is abstracted on the DataFrame.
You can run anything locally on the driver but to make it run in parallel on 
the cluster you'll need to use the DataFrame abstraction.
You may want to check maxpumperla/elephas.

| 
| 
| 
|  |  |

 |

 |
| 
|  | 
maxpumperla/elephas

Distributed Deep learning with Keras & Spark. Contribute to maxpumperla/elephas 
development by creating an accou...
 |

 |

 |



Regards,Juan Martín.


El lunes, 13 de julio de 2020 08:59:35 ART, Mukhtaj Khan 
 escribió:  
 
 Dear Spark User
I am trying to parallelize the CNN (convolutional neural network) model using 
spark. I have developed the model using python and Keras library. The model 
works fine on a single machine but when we try on multiple machines, the 
execution time remains the same as sequential.Could you please tell me that 
there is any built-in library for CNN to parallelize in spark framework. 
Moreover, MLLIB does not have any support for CNN.Best regardsMukhtaj
  


  

Re: Spark yarn cluster

2020-07-11 Thread Juan Martín Guillén
 Hi Diwakar,

A Yarn cluster not having Hadoop is kind of a fuzzy concept.
Definitely you may want to have Hadoop and don't need to use MapReduce and use 
Spark instead. That is the main reason to use Spark in a Hadoop cluster anyway.
On the other hand it is highly probable you may want to use HDFS although not 
strictly necessary.
So answering your question you are using Hadoop by using Yarn because it is one 
of the 3 main components of it but that doesn't mean you need to use other 
components of the Hadoop cluster, namely MapReduce and HDFS.
That being said, if you just need cluster scheduling and not using MapReduce 
nor HDFS it is possible you will be fine with the Spark Standalone cluster.

Regards,Juan Martín.

El sábado, 11 de julio de 2020 13:57:40 ART, Diwakar Dhanuskodi 
 escribió:  
 
 Hi ,
Could it be possible to setup Spark within Yarn cluster which may not have 
Hadoop?. 
Thanks.  

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Juan Martín Guillén
 Would you be able to send the code you are running?That would be great if you 
include some sample data.
Is that possible?

El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) 
 escribió:  
 
 Hi Stephen and Juan,

Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).

That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead (which can be observed
from the web UI), which is really noticeable as a user.

For instance, to display the paginated grid in the tool's UI, I need to
run a simple job (filterByRange), and Spark's own overheads account for
about half of the overall execution time.

Intuitively, when running in local mode there should not be any need for
serializing tasks to pass them between threads, so that is what I am
trying to eliminate.

Regards,
Antonin

On 04/07/2020 17:49, Juan Martín Guillén wrote:
> Hi Antonin.
> 
> It seems you are confusing Standalone with Local mode. They are 2
> different modes.
> 
> From Spark in Action book: "In local mode, there is only one executor in
> the same client JVM as the driver, but
> this executor can spawn several threads to run tasks.
> In local mode, Spark uses your client process as the single executor in
> the cluster,
> and the number of threads specified determines how many tasks can be
> executed in parallel."
> 
> I am pretty sure this is the mode your use case is more suited to.
> 
> What you are referring to, I think, is to run an Standalone Cluster
> locally, something that does not make too much sense resources wise and
> is what may be considered only for testing purposes.
> 
> Running Spark in Local mode is totally fine and supported for
> non-cluster (local) environments.
> 
> Here the options you have to connect you Spark application to:
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
> 
> Regards,
> Juan Martín.
> 
> 
> 
> 
> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
>  escribió:
> 
> 
> Hi,
> 
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
> 
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
> 
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
> 
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
> 
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
> 
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
> 
> Cheers,
> Antonin
> 
> 
> 
> 
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> <mailto:user-unsubscr...@spark.apache.org>
> 


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

  

Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Juan Martín Guillén
 Hi Antonin.
It seems you are confusing Standalone with Local mode. They are 2 different 
modes.
>From Spark in Action book: "In local mode, there is only one executor in the 
>same client JVM as the driver, butthis executor can spawn several threads to 
>run tasks. 
In local mode, Spark uses your client process as the single executor in the 
cluster, 
and the number of threads specified determines how many tasks can be executed 
in parallel."
I am pretty sure this is the mode your use case is more suited to.
What you are referring to, I think, is to run an Standalone Cluster locally, 
something that does not make too much sense resources wise and is what may be 
considered only for testing purposes.
Running Spark in Local mode is totally fine and supported for non-cluster 
(local) environments.
Here the options you have to connect you Spark application to: 
https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
Regards,Juan Martín.



El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) 
 escribió:  
 
 Hi,

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.

Cheers,
Antonin





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org