Re: Is Spark in Java a bad idea?
Don't be too concerned about the Scala hoop. Before making the commitment to Scala, I had coded up a modest analytic prototype in Hadoop mapreduce. Once making the commitment, it took 10 days to (1) learn enough Scala, and (2) re-write the prototype in Spark in Scala. In so doing, the execution time for this prototype was cut in 1/8 and the lines of code for identical functionality was about 1/10. A few things helped me... - Martin Odersky's "Programming in Scala". No need to read the whole thing, but use it as a reference and together with the course. - His "Functional Programming Principles in Scala" on Coursera. It's not necessary that you enroll in a concurrent course. "Enroll" in a past course and watch the videos and do a few exercises. https://class.coursera.org/progfun-003 - The cheat-cheats on the Scala website. http://docs.scala-lang.org/cheatsheets/?_ga=1.267044046.1769090313.1387491444 - Example code in Spark. Plenty of it to go around. Once you have experienced the glories of Scala, there's no turning back. It is a computer science cornucopia! Kevin On 10/28/2014 01:15 PM, Ron Ayoub wrote: I interpret this to mean you have to learn Scala in order to work with Spark in Scala (goes without saying) and also to work with Spark in Java (since you have to jump through some hoops for basic functionality). The best path here is to take this as a learning opportunity and sit down and learn Scala. Regarding RDD being an internal API, it has two methods that clearly allow you to override them which the JdbcRDD does and it looks close to trivial - if I only new Scala. Once I learn Scala, I would say the first thing I plan on doing is writing my own OracleRDD with my own flavor of Jdbc code. Why would this not be advisable? Subject: Re: Is Spark in Java a bad idea? From: matei.zaha...@gmail.com Date: Tue, 28 Oct 2014 11:56:39 -0700 CC: u...@spark.incubator.apache.org To: isasmani@gmail.com A pretty large fraction of users use Java, but a few features are still not available in it. JdbcRDD is one of them -- this functionality will likely be superseded by Spark SQL when we add JDBC as a data source. In the meantime, to use it, I'd recommend writing a class in Scala that has Java-friendly methods and getting an RDD to it from that. Basically the two parameters that weren't friendly there were the ClassTag and the getConnection and mapRow functions. Subclassing RDD in Java is also not really supported, because that's an internal API. We don't expect users to be defining their own RDDs. Matei On Oct 28, 2014, at 11:47 AM, critikaled isasmani@gmail.com wrote: Hi Ron, what ever api you have in scala you can possibly use it form java. scala is inter-operable with java and vice versa. scala being both object oriented and functional will make your job easier on jvm and it is more consise than java. Take it as an opportunity and start learning scala ;). -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Running Spark shell on YARN
Sandy and others: Is there a single source of Yarn/Hadoop properties that should be set or reset for running Spark on Yarn? We've sort of stumbled through one property after another, and (unless there's an update I've not yet seen) CDH5 Spark-related properties are for running the Spark Master instead of Yarn. Thanks Kevin On 08/15/2014 12:47 PM, Sandy Ryza wrote: We generally recommend setting yarn.scheduler.maximum-allocation-mbto the maximum node capacity. -Sandy On Fri, Aug 15, 2014 at 11:41 AM, Soumya Simanta soumya.sima...@gmail.com wrote: I just checked the YARN config and looks like I need to change this value. Should be upgraded to 48G (the max memory allocated to YARN) per node ? property nameyarn.scheduler.maximum-allocation-mb/name value6144/value sourcejava.io.BufferedInputStream@2e7e1ee/source /property On Fri, Aug 15, 2014 at 2:37 PM, Soumya Simanta soumya.sima...@gmail.com wrote: Andrew, Thanks for your response. When I try to do the following. ./spark-shell --executor-memory 46g --master yarn I get the following error. Exception in thread "main" java.lang.Exception: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment. at org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166) at org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) After this I set the following env variable. export YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/ The program launches but then halts with the following error. 14/08/15 14:33:22 ERROR yarn.Client: Required executor memory (47104 MB), is above the max threshold (6144 MB) of this cluster. I guess this is some YARN setting that is not set correctly. Thanks -Soumya On Fri, Aug 15, 2014 at 2:19 PM, Andrew Or and...@databricks.com wrote: Hi Soumya, The driver's console output prints out how much memory is actually granted to each executor, so from there you can verify how much memory the executors are actually getting. You should use the '--executor-memory' argument in spark-shell. For instance, assuming each node has 48G of memory, bin/spark-shell --executor-memory 46g --master yarn We leave a small cushion for the
Re: Comparative study
When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com wrote: Hello Experts, I am doing some comparative study on the below: Spark vs Impala Spark vs MapREduce . Is it worth migrating from existing MR implementation to Spark? Please share your thoughts and expertise. Thanks, Santosh This message is for the designated recipient only and may contain privileged, proprietary, or otherwise confidential information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the e-mail by you is prohibited. Where allowed by local law, electronic communications with Accenture and its affiliates, including e-mail and instant messaging (including content), may be scanned by our systems for the purposes of information security and assessment of internal compliance with Accenture policy. __ www.accenture.com -- Daniel Siegmann, Software Developer Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001 E: daniel.siegm...@velos.io W: www.velos.io
Re: Comparative study
It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory demand, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Siegmann wrote: From a development perspective, I vastly prefer Spark to MapReduce. The MapReduce API is very constrained; Spark's API feels much more natural to me. Testing and local development is also very easy - creating a local Spark context is trivial and it reads local files. For your unit tests you can just have them create a local context and execute your flow with some test data. Even better, you can do ad-hoc work in the Spark shell and if you want that in your production code it will look exactly the same. Unfortunately, the picture isn't so rosy when it gets to production. In my experience, Spark simply doesn't scale to the volumes that MapReduce will handle. Not with a Standalone cluster anyway - maybe Mesos or YARN would be better, but I haven't had the opportunity to try them. I find jobs tend to just hang forever for no apparent reason on large data sets (but smaller than what I push through MapReduce). I am hopeful the situation will improve - Spark is developing quickly - but if you have large amounts of data you should proceed with caution. Keep in mind there are some frameworks for Hadoop which can hide the ugly MapReduce with something very similar in form to Spark's API; e.g. Apache Crunch. So you might consider those as well. (Note: the above is with Spark 1.0.0.) On Mon, J
Re: Comparative study
Nothing particularly custom. We've tested with small (4 node) development clusters, single-node pseudoclusters, and bigger, using plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark master, Spark local, Spark Yarn (client and cluster) modes, with total memory resources ranging from 4GB to 256GB+. K On 07/08/2014 12:04 PM, Surendranauth Hiraman wrote: To clarify, we are not persisting to disk. That was just one of the experiments we did because of some issues we had along the way. At this time, we are NOT using persist but cannot get the flow to complete in Standalone Cluster mode. We do not have a YARN-capable cluster at this time. We agree with what you're saying. Your results are what we were hoping for and expecting. :-) Unfortunately we still haven't gotten the flow to run end to end on this relatively small dataset. It must be something related to our cluster, standalone mode or our flow but as far as we can tell, we are not doing anything unusual. Did you do any custom configuration? Any advice would be appreciated. -Suren On Tue, Jul 8, 2014 at 1:54 PM, Kevin Markey kevin.mar...@oracle.com wrote: It seems to me that you're not taking full advantage of the lazy evaluation, especially persisting to disk only. While it might be true that the cumulative size of the RDDs looks like it's 300GB, only a small portion of that should be resident at any one time. We've evaluated data sets much greater than 10GB in Spark using the Spark master and Spark with Yarn (cluster -- formerly standalone -- mode). Nice thing about using Yarn is that it reports the actual memory demand, not just the memory requested for driver and workers. Processing a 60GB data set through thousands of stages in a rather complex set of analytics and transformations consumed a total cluster resource (divided among all workers and driver) of only 9GB. We were somewhat startled at first by this result, thinking that it would be much greater, but realized that it is a consequence of Spark's lazy evaluation model. This is even with several intermediate computations being cached as input to multiple evaluation paths. Good luck. Kevin On 07/08/2014 11:04 AM, Surendranauth Hiraman wrote: I'll respond for Dan. Our test dataset was a total of 10 GB of input data (full production dataset for this particular dataflow would be 60 GB roughly). I'm not sure what the size of the final output data was but I think it was on the order of 20 GBs for the given 10 GB of input data. Also, I can say that when we were experimenting with persist(DISK_ONLY), the size of all RDDs on disk was around 200 GB, which gives a sense of overall transient memory usage with no persistence. In terms of our test cluster, we had 15 nodes. Each node had 24 cores and 2 workers each. Each executor got 14 GB of memory. -Suren On Tue, Jul 8, 2014 at 12:06 PM, Kevin Markey kevin.mar...@oracle.com wrote: When you say "large data sets", how large? Thanks On 07/07/2014 01:39 PM, Daniel Sieg
Re: trying to understand yarn-client mode
Yarn client is much like Spark client mode, except that the executors are running in Yarn containers managed by the Yarn resource manager on the cluster instead of as Spark workers managed by the Spark master. The driver executes as a local client in your local JVM. It communicates with the workers on the cluster. Transformations are scheduled on the cluster by the driver's logic. Actions involve communication between local driver and remote cluster executors. So, there is some additional network overhead, especially if the driver is not co-located on the cluster. In yarn-cluster mode -- in contrast, the driver is executed as a thread in a Yarn application master on the cluster. In either case, the assembly JAR must be available to the application on the cluster. Best to copy it to HDFS and specify its location by exporting its location as SPARK_JAR. Kevin Markey On 06/19/2014 11:22 AM, Koert Kuipers wrote: i am trying to understand how yarn-client mode works. i am not using spark-submit, but instead launching a spark job from within my own application. i can see my application contacting yarn successfully, but then in yarn i get an immediate error: Application application_1403117970283_0014 failed 2 times due to AM Container for appattempt_1403117970283_0014_02 exited with exitCode: -1000 due to: File file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does not exist .Failing this attempt.. Failing the application. why is yarn trying to fetch my jar, and why as a local file? i would expect the jar to be send to yarn over the wire upon job submission?
Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
Tom On Wednesday, May 21, 2014 6:10 PM, Kevin Markey kevin.mar...@oracle.com wrote: I tested an application on RC-10 and Hadoop 2.3.0 in yarn-cluster mode that had run successfully with Spark-0.9.1 and Hadoop 2.3 or 2.2. The application successfully ran to conclusion but it ultimately failed. There were 2 anomalies... 1. ASM reported only that the application was "ACCEPTED". It never indicated that the application was "RUNNING." 14/05/21 16:06:12 INFO yarn.Client: Application report from ASM: application identifier: application_1400696988985_0007 appId: 7 clientToAMToken: null appDiagnostics: appMasterHost: N/A appQueue: default appMasterRpcPort: -1 appStartTime: 1400709970857 yarnAppState: ACCEPTED distributedFinalState: UNDEFINED appTrackingUrl: http://Sleepycat:8088/proxy/application_1400696988985_0007/ appUser: hduser Furthermore, it started a second container, running two partly overlapping drivers, when it appeared that the application never started. Each container ran to conclusion as explained above, taking twice as long as usual for both to complete. Both instances had the same concluding failure. 2. Each instance failed as indicated by the stderr log, finding that the filesystem was closed when trying to clean up the staging directories. 14/05/21 16:08:24 INFO Executor: Serialized size of result for 1453 is 863 14/05/21 16:08:24 INFO Executor: Sending result for 1453 directly to driver 14/05/21 16:08:24 INFO Executor: Finished task ID 1453 14/05/21 16:08:24 INFO TaskSetManager: Finished TID 1453 in 202 ms on localhost (progress: 2/2) 14/05/21 16:08:24 INFO DAGScheduler: Completed ResultTask(1507, 1) 14/05/21 16:08:24 INFO TaskSchedulerImpl: Removed TaskSet 1507.0, whose tasks have all completed, from pool 14/05/21 16:08:24 INFO DAGScheduler: Stage 1507 (count at KEval.scala:32) finished in 0.417 s 14/05/21 16:08:24 INFO SparkContext: Job finished: count at KEval.scala:32, took 1.532789283 s
Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory
16:06 /user/hduser/.sparkStaging/application_1400696988985_0007/spark-assembly-1.0.0-hadoop2.3.0.jar Just prior to the staging directory cleanup, the application concluded by writing results to 3 HDFS files. That occurred without incident. This particular test was run using ... 1. RC10 compiled as follows: mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package 2. Ran in yarn-cluster mode using spark-submit Is there any configuration new to 1.0.0 that I might be missing. I walked through all the changes in the Yarn deploy web page, updating my scripts and configuration appropriately, and running except for these two anomalies. Thanks Kevin Markey
Re: Job initialization performance of Spark standalone mode vs YARN
We are now testing precisely what you ask about in our environment. But Sandy's questions are relevant. The bigger issue is not Spark vs. Yarn but "client" vs. "standalone" and where the client is located on the network relative to the cluster. The "client" options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster. The "standalone" options locate the master/client on the cluster. In yarn-standalone, the master is a thread contained by the Yarn Resource Manager. Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency. In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone! At least for a job with lots of stages and lots of client/worker communication, although rather few "collect" actions, so it's mainly scheduling that's relevant here. I'll be posting more information as I have it available. Kevin On 03/03/2014 03:48 PM, Sandy Ryza wrote: Are you running in yarn-standalone mode or yarn-client mode? Also, what YARN scheduler and what NodeManager heartbeat? On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing at capacity in under a second :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Is there a way to get the current progress of the job?
The discussion there hits on the distinction of jobs and stages. When looking at one application, there are hundreds of stages, sometimes thousands. Depends on the data and the task. And the UI seems to track stages. And one could independently track them for such a job. But what if -- as occurs in another application -- there's only one or two stages, but lots of data passing through those 1 or 2 stages? Kevin Markey On 04/01/2014 09:55 AM, Mark Hamstra wrote: Some related discussion:https://github.com/apache/spark/pull/246 On Tue, Apr 1, 2014 at 8:43 AM, Philip Ogren philip.og...@oracle.com wrote: Hi DB, Just wondering if you ever got an answer to your question about monitoring progress - either offline or through your own investigation. Any findings would be appreciated. Thanks, Philip On 01/30/2014 10:32 PM, DB Tsai wrote: Hi guys, When we're running a very long job, we would like to show users the current progress of map and reduce job. After looking at the api document, I don't find anything for this. However, in Spark UI, I could see the progress of the task. Is there anything I miss? Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/