Re: Issue in parallelization of CNN model using spark
Hi Mukhtaj, Parallelization on Spark is abstracted on the DataFrame. You can run anything locally on the driver but to make it run in parallel on the cluster you'll need to use the DataFrame abstraction. You may want to check maxpumperla/elephas. | | | | | | | | | | | maxpumperla/elephas Distributed Deep learning with Keras & Spark. Contribute to maxpumperla/elephas development by creating an accou... | | | Regards,Juan Martín. El lunes, 13 de julio de 2020 08:59:35 ART, Mukhtaj Khan escribió: Dear Spark User I am trying to parallelize the CNN (convolutional neural network) model using spark. I have developed the model using python and Keras library. The model works fine on a single machine but when we try on multiple machines, the execution time remains the same as sequential.Could you please tell me that there is any built-in library for CNN to parallelize in spark framework. Moreover, MLLIB does not have any support for CNN.Best regardsMukhtaj
Re: Spark yarn cluster
Hi Diwakar, A Yarn cluster not having Hadoop is kind of a fuzzy concept. Definitely you may want to have Hadoop and don't need to use MapReduce and use Spark instead. That is the main reason to use Spark in a Hadoop cluster anyway. On the other hand it is highly probable you may want to use HDFS although not strictly necessary. So answering your question you are using Hadoop by using Yarn because it is one of the 3 main components of it but that doesn't mean you need to use other components of the Hadoop cluster, namely MapReduce and HDFS. That being said, if you just need cluster scheduling and not using MapReduce nor HDFS it is possible you will be fine with the Spark Standalone cluster. Regards,Juan Martín. El sábado, 11 de julio de 2020 13:57:40 ART, Diwakar Dhanuskodi escribió: Hi , Could it be possible to setup Spark within Yarn cluster which may not have Hadoop?. Thanks.
Re: RDD-like API for entirely local workflows?
Would you be able to send the code you are running?That would be great if you include some sample data. Is that possible? El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) escribió: Hi Stephen and Juan, Thanks both for your replies - you are right, I used the wrong terminology! The local mode is what fits our needs best (and what I have benchmarking so far). That being said, the problems I mention are still applicable to this context. There is still a serialization overhead (which can be observed from the web UI), which is really noticeable as a user. For instance, to display the paginated grid in the tool's UI, I need to run a simple job (filterByRange), and Spark's own overheads account for about half of the overall execution time. Intuitively, when running in local mode there should not be any need for serializing tasks to pass them between threads, so that is what I am trying to eliminate. Regards, Antonin On 04/07/2020 17:49, Juan Martín Guillén wrote: > Hi Antonin. > > It seems you are confusing Standalone with Local mode. They are 2 > different modes. > > From Spark in Action book: "In local mode, there is only one executor in > the same client JVM as the driver, but > this executor can spawn several threads to run tasks. > In local mode, Spark uses your client process as the single executor in > the cluster, > and the number of threads specified determines how many tasks can be > executed in parallel." > > I am pretty sure this is the mode your use case is more suited to. > > What you are referring to, I think, is to run an Standalone Cluster > locally, something that does not make too much sense resources wise and > is what may be considered only for testing purposes. > > Running Spark in Local mode is totally fine and supported for > non-cluster (local) environments. > > Here the options you have to connect you Spark application to: > https://spark.apache.org/docs/latest/submitting-applications.html#master-urls > > Regards, > Juan Martín. > > > > > El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) > escribió: > > > Hi, > > I am working on revamping the architecture of OpenRefine, an ETL tool, > to execute workflows on datasets which do not fit in RAM. > > Spark's RDD API is a great fit for the tool's operations, and provides > everything we need: partitioning and lazy evaluation. > > However, OpenRefine is a lightweight tool that runs locally, on the > users' machine, and we want to preserve this use case. Running Spark in > standalone mode works, but I have read at a couple of places that the > standalone mode is only intended for development and testing. This is > confirmed by my experience with it so far: > - the overhead added by task serialization and scheduling is significant > even in standalone mode. This makes sense for testing, since you want to > test serialization as well, but to run Spark in production locally, we > would need to bypass serialization, which is not possible as far as I know; > - some bugs that manifest themselves only in local mode are not getting > a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so > it seems dangerous to base a production system on standalone Spark. > > So, we cannot use Spark as default runner in the tool. Do you know any > alternative which would be designed for local use? A library which would > provide something similar to the RDD API, but for parallelization with > threads in the same JVM, not machines in a cluster? > > If there is no such thing, it should not be too hard to write our > homegrown implementation, which would basically be Java streams with > partitioning. I have looked at Apache Beam's direct runner, but it is > also designed for testing so does not fit our bill for the same reasons. > > We plan to offer a Spark-based runner in any case - but I do not think > it can be used as the default runner. > > Cheers, > Antonin > > > > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: RDD-like API for entirely local workflows?
Hi Antonin. It seems you are confusing Standalone with Local mode. They are 2 different modes. >From Spark in Action book: "In local mode, there is only one executor in the >same client JVM as the driver, butthis executor can spawn several threads to >run tasks. In local mode, Spark uses your client process as the single executor in the cluster, and the number of threads specified determines how many tasks can be executed in parallel." I am pretty sure this is the mode your use case is more suited to. What you are referring to, I think, is to run an Standalone Cluster locally, something that does not make too much sense resources wise and is what may be considered only for testing purposes. Running Spark in Local mode is totally fine and supported for non-cluster (local) environments. Here the options you have to connect you Spark application to: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls Regards,Juan Martín. El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) escribió: Hi, I am working on revamping the architecture of OpenRefine, an ETL tool, to execute workflows on datasets which do not fit in RAM. Spark's RDD API is a great fit for the tool's operations, and provides everything we need: partitioning and lazy evaluation. However, OpenRefine is a lightweight tool that runs locally, on the users' machine, and we want to preserve this use case. Running Spark in standalone mode works, but I have read at a couple of places that the standalone mode is only intended for development and testing. This is confirmed by my experience with it so far: - the overhead added by task serialization and scheduling is significant even in standalone mode. This makes sense for testing, since you want to test serialization as well, but to run Spark in production locally, we would need to bypass serialization, which is not possible as far as I know; - some bugs that manifest themselves only in local mode are not getting a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so it seems dangerous to base a production system on standalone Spark. So, we cannot use Spark as default runner in the tool. Do you know any alternative which would be designed for local use? A library which would provide something similar to the RDD API, but for parallelization with threads in the same JVM, not machines in a cluster? If there is no such thing, it should not be too hard to write our homegrown implementation, which would basically be Java streams with partitioning. I have looked at Apache Beam's direct runner, but it is also designed for testing so does not fit our bill for the same reasons. We plan to offer a Spark-based runner in any case - but I do not think it can be used as the default runner. Cheers, Antonin - To unsubscribe e-mail: user-unsubscr...@spark.apache.org