Re: Is Spark in Java a bad idea?

2014-10-28 Thread Kevin Markey

  
  
Don't be too concerned about the Scala hoop.  Before making the
commitment to Scala, I had coded up a modest analytic prototype in
Hadoop mapreduce.  Once making the commitment, it took 10 days to
(1) learn enough Scala, and (2) re-write the prototype in Spark in
Scala.  In so doing, the execution time for this prototype was cut
in 1/8 and the lines of code for identical functionality was about
1/10.  

A few things helped me...

- Martin Odersky's "Programming in Scala".  No need to read the
whole thing, but use it as a reference and together with the course.
- His "Functional Programming Principles in Scala" on Coursera. 
It's not necessary that you enroll in a concurrent course.  "Enroll"
in a past course and watch the videos and do a few exercises. 
https://class.coursera.org/progfun-003
- The cheat-cheats on the Scala website. 
http://docs.scala-lang.org/cheatsheets/?_ga=1.267044046.1769090313.1387491444
- Example code in Spark.  Plenty of it to go around.

Once you have experienced the glories of Scala, there's no turning
back.  It is a computer science cornucopia!

Kevin


On 10/28/2014 01:15 PM, Ron Ayoub
  wrote:


  
  I interpret this to mean you have to learn Scala in
order to work with Spark in Scala (goes without saying) and also
to work with Spark in Java (since you have to jump through some
hoops for basic functionality).


The best path here is to take this as a learning
  opportunity and sit down and learn Scala. 


Regarding RDD being an internal API, it has two methods
  that clearly allow you to override them which the JdbcRDD does
  and it looks close to trivial - if I only new Scala. Once I
  learn Scala, I would say the first thing I plan on doing is
  writing my own OracleRDD with my own flavor of Jdbc code. Why
  would this not be advisable?
 

 Subject: Re: Is Spark in Java a bad idea?
   From: matei.zaha...@gmail.com
   Date: Tue, 28 Oct 2014 11:56:39 -0700
   CC: u...@spark.incubator.apache.org
   To: isasmani@gmail.com
   
   A pretty large fraction of users use Java, but a few
  features are still not available in it. JdbcRDD is one of them
  -- this functionality will likely be superseded by Spark SQL
  when we add JDBC as a data source. In the meantime, to use it,
  I'd recommend writing a class in Scala that has Java-friendly
  methods and getting an RDD to it from that. Basically the two
  parameters that weren't friendly there were the ClassTag and
  the getConnection and mapRow functions.
   
   Subclassing RDD in Java is also not really supported,
  because that's an internal API. We don't expect users to be
  defining their own RDDs.
   
   Matei
   
On Oct 28, 2014, at 11:47 AM, critikaled
  isasmani@gmail.com wrote:

Hi Ron,
what ever api you have in scala you can possibly use
  it form java. scala is
inter-operable with java and vice versa. scala being
  both object oriented
and functional will make your job easier on jvm and
  it is more consise than
java. Take it as an opportunity and start learning
  scala ;).



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Is-Spark-in-Java-a-bad-idea-tp17534p17538.html
Sent from the Apache Spark User List mailing list
  archive at Nabble.com.

   
  -
To unsubscribe, e-mail:
  user-unsubscr...@spark.apache.org
For additional commands, e-mail:
  user-h...@spark.apache.org

   
   
  
  -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail:
  user-h...@spark.apache.org
   

  


  


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Running Spark shell on YARN

2014-08-15 Thread Kevin Markey

  
  
Sandy and others:

Is there a single source of Yarn/Hadoop properties that should be
set or reset for running Spark on Yarn?
We've sort of stumbled through one property after another, and
(unless there's an update I've not yet seen) CDH5 Spark-related
properties are for running the Spark Master instead of Yarn.

Thanks
Kevin

On 08/15/2014 12:47 PM, Sandy Ryza
  wrote:


  We generally recommend setting yarn.scheduler.maximum-allocation-mbto
the maximum node capacity.

  

-Sandy
  
  

On Fri, Aug 15, 2014 at 11:41 AM,
  Soumya Simanta soumya.sima...@gmail.com
  wrote:
  
I just checked the YARN config and looks like
  I need to change this value. Should be upgraded to 48G
  (the max memory allocated to YARN) per node ? 
  

  
  

  property

  nameyarn.scheduler.maximum-allocation-mb/name
  value6144/value
  sourcejava.io.BufferedInputStream@2e7e1ee/source

/property
  


  
  On Fri, Aug 15, 2014 at 2:37 PM,
Soumya Simanta soumya.sima...@gmail.com
wrote:

  Andrew, 


Thanks for your response. 


When I try to do the following. 

   ./spark-shell
  --executor-memory 46g --master yarn
  I get the following error. 
  Exception
  in thread "main" java.lang.Exception: When
  running with master 'yarn' either
  HADOOP_CONF_DIR or YARN_CONF_DIR must be set
  in the environment.
   at
org.apache.spark.deploy.SparkSubmitArguments.checkRequiredArguments(SparkSubmitArguments.scala:166)
   at
org.apache.spark.deploy.SparkSubmitArguments.init(SparkSubmitArguments.scala:61)
   at
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:50)
  
  
   at
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
  After this I set the following env variable. 
  
  
  export
  YARN_CONF_DIR=/usr/lib/hadoop-yarn/etc/hadoop/
  The program launches but then halts with the
following error. 
  
  
  
  
  14/08/15 14:33:22 ERROR
yarn.Client: Required executor memory (47104
MB), is above the max threshold (6144 MB) of
this cluster.
  
I guess this is some YARN setting that is not
set correctly. 
  
  
  
  Thanks
  
  -Soumya

  
  

  

  


On Fri, Aug 15,
  2014 at 2:19 PM, Andrew Or and...@databricks.com
  wrote:
  
Hi Soumya,
  
  
  The driver's console output
prints out how much memory is
actually granted to each executor,
so from there you can verify how
much memory the executors are
actually getting. You should use the
'--executor-memory' argument in
spark-shell. For instance, assuming
each node has 48G of memory,
  
  
  bin/spark-shell --executor-memory
46g --master yarn
  
  
  We leave a small cushion for the
   

Re: Comparative study

2014-07-08 Thread Kevin Markey

  
  
When you say "large data sets", how large?
Thanks

On 07/07/2014 01:39 PM, Daniel Siegmann
  wrote:


  

  From a development perspective, I vastly prefer Spark to
MapReduce. The MapReduce API is very constrained; Spark's
API feels much more natural to me. Testing and local
development is also very easy - creating a local Spark
context is trivial and it reads local files. For your unit
tests you can just have them create a local context and
execute your flow with some test data. Even better, you can
do ad-hoc work in the Spark shell and if you want that in
your production code it will look exactly the same.

  
  Unfortunately, the picture isn't so rosy when it gets to
production. In my experience, Spark simply doesn't scale to
the volumes that MapReduce will handle. Not with a
Standalone cluster anyway - maybe Mesos or YARN would be
better, but I haven't had the opportunity to try them. I
find jobs tend to just hang forever for no apparent reason
on large data sets (but smaller than what I push through
MapReduce).

  
  I am hopeful the situation will improve - Spark is
developing quickly - but if you have large amounts of data
you should proceed with caution.

  
  Keep in mind there are some frameworks for Hadoop which
can hide the ugly MapReduce with something very similar in
form to Spark's API; e.g. Apache Crunch. So you might
consider those as well.

  
  (Note: the above is with Spark 1.0.0.)
  
  
  

  
  

On Mon, Jul 7, 2014 at 11:07 AM, santosh.viswanat...@accenture.com
  wrote:
  

  
Hello Experts,
 
I am doing some comparative study
  on the below:
 
Spark vs Impala
Spark vs MapREduce . Is it worth
  migrating from existing MR implementation to Spark?
 
 
Please share your thoughts and
  expertise.
 
 
Thanks,
  Santosh
  
  
  
  
This message is for the designated recipient only and
may contain privileged, proprietary, or otherwise
confidential information. If you have received it in
error, please notify the sender immediately and delete
the original. Any other use of the e-mail by you is
prohibited. Where allowed by local law, electronic
communications with Accenture and its affiliates,
including e-mail and instant messaging (including
content), may be scanned by our systems for the purposes
of information security and assessment of internal
compliance with Accenture policy. 
__

www.accenture.com
  

  




-- 

  Daniel
  Siegmann, Software Developer
Velos

  Accelerating
Machine Learning
  
  
440 NINTH AVENUE, 11TH FLOOR,
NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io

  


  



Re: Comparative study

2014-07-08 Thread Kevin Markey

  
  
It seems to me that you're not taking full advantage of the lazy
evaluation, especially persisting to disk only.  While it might be
true that the cumulative size of the RDDs looks like it's 300GB,
only a small portion of that should be resident at any one time. 
We've evaluated data sets much greater than 10GB in Spark using the
Spark master and Spark with Yarn (cluster -- formerly standalone --
mode).  Nice thing about using Yarn is that it reports the actual
memory demand, not just the memory requested for driver and
workers.  Processing a 60GB data set through thousands of stages in
a rather complex set of analytics and transformations consumed a
total cluster resource (divided among all workers and driver) of
only 9GB.  We were somewhat startled at first by this result,
thinking that it would be much greater, but realized that it is a
consequence of Spark's lazy evaluation model.  This is even with
several intermediate computations being cached as input to multiple
evaluation paths.  

Good luck.

Kevin


On 07/08/2014 11:04 AM, Surendranauth
  Hiraman wrote:


  I'll respond for Dan.


Our test dataset was a total of 10 GB of input data (full
  production dataset for this particular dataflow would be 60 GB
  roughly). 


I'm not sure what the size of the final output data was but
  I think it was on the order of 20 GBs for the given 10 GB of
  input data. Also, I can say that when we were experimenting
  with persist(DISK_ONLY), the size of all RDDs on disk was
  around 200 GB, which gives a sense of overall transient memory
  usage with no persistence.


In terms of our test cluster, we had 15 nodes. Each node
  had 24 cores and 2 workers each. Each executor got 14 GB of
  memory.


-Suren


  
  


On Tue, Jul 8, 2014 at 12:06 PM, Kevin
  Markey kevin.mar...@oracle.com
  wrote:
  
 When you say "large
  data sets", how large?
  Thanks
  

  
  On 07/07/2014 01:39 PM, Daniel Siegmann wrote:
  
  

  
From a development perspective, I vastly
  prefer Spark to MapReduce. The MapReduce API
  is very constrained; Spark's API feels much
  more natural to me. Testing and local
  development is also very easy - creating a
  local Spark context is trivial and it reads
  local files. For your unit tests you can just
  have them create a local context and execute
  your flow with some test data. Even better,
  you can do ad-hoc work in the Spark shell and
  if you want that in your production code it
  will look exactly the same.
  

Unfortunately, the picture isn't so rosy
  when it gets to production. In my experience,
  Spark simply doesn't scale to the volumes that
  MapReduce will handle. Not with a Standalone
  cluster anyway - maybe Mesos or YARN would be
  better, but I haven't had the opportunity to
  try them. I find jobs tend to just hang
  forever for no apparent reason on large data
  sets (but smaller than what I push through
  MapReduce).
  

I am hopeful the situation will improve -
  Spark is developing quickly - but if you have
  large amounts of data you should proceed with
  caution.
  

Keep in mind there are some frameworks for
  Hadoop which can hide the ugly MapReduce with
  something very similar in form to Spark's API;
  e.g. Apache Crunch. So you might consider
  those as well.
  

(Note: the above is with Spark 1.0.0.)



  


  
  On Mon, J

Re: Comparative study

2014-07-08 Thread Kevin Markey

  
  
Nothing particularly custom.  We've tested with small (4 node)
development clusters, single-node pseudoclusters, and bigger, using
plain-vanilla Hadoop 2.2 or 2.3 or CDH5 (beta and beyond), in Spark
master, Spark local, Spark Yarn (client and cluster) modes, with
total memory resources ranging from 4GB to 256GB+.  

K


On 07/08/2014 12:04 PM, Surendranauth
  Hiraman wrote:


  To clarify, we are not persisting to disk. That was
just one of the experiments we did because of some issues we had
along the way.


At this time, we are NOT using persist but cannot get the
  flow to complete in Standalone Cluster mode. We do not have a
  YARN-capable cluster at this time.


We agree with what you're saying. Your results are what we
  were hoping for and expecting. :-)  Unfortunately we still
  haven't gotten the flow to run end to end on this relatively
  small dataset.


It must be something related to our cluster, standalone
  mode or our flow but as far as we can tell, we are not doing
  anything unusual.


Did you do any custom configuration? Any advice would be
  appreciated.


-Suren




  
  

On Tue, Jul 8, 2014 at 1:54 PM, Kevin
  Markey kevin.mar...@oracle.com
  wrote:
  
 It seems to me that
  you're not taking full advantage of the lazy evaluation,
  especially persisting to disk only.  While it might be
  true that the cumulative size of the RDDs looks like it's
  300GB, only a small portion of that should be resident at
  any one time.  We've evaluated data sets much greater than
  10GB in Spark using the Spark master and Spark with Yarn
  (cluster -- formerly standalone -- mode).  Nice thing
  about using Yarn is that it reports the actual memory demand,
  not just the memory requested for driver and workers. 
  Processing a 60GB data set through thousands of stages in
  a rather complex set of analytics and transformations
  consumed a total cluster resource (divided among all
  workers and driver) of only 9GB.  We were somewhat
  startled at first by this result, thinking that it would
  be much greater, but realized that it is a consequence of
  Spark's lazy evaluation model.  This is even with several
  intermediate computations being cached as input to
  multiple evaluation paths.  
  
  Good luck.
  
  Kevin
  

  
  
  On 07/08/2014 11:04 AM, Surendranauth Hiraman
wrote:
  
  
I'll respond for Dan.
  
  
  Our test dataset was a total of 10 GB of
input data (full production dataset for this
particular dataflow would be 60 GB roughly). 
  
  
  I'm not sure what the size of the final
output data was but I think it was on the order
of 20 GBs for the given 10 GB of input data.
Also, I can say that when we were experimenting
with persist(DISK_ONLY), the size of all RDDs on
disk was around 200 GB, which gives a sense of
overall transient memory usage with no
persistence.
  
  
  In terms of our test cluster, we had 15
nodes. Each node had 24 cores and 2 workers
each. Each executor got 14 GB of memory.
  
  
  -Suren
  
  

 
  
  On Tue, Jul 8, 2014 at
12:06 PM, Kevin Markey kevin.mar...@oracle.com
wrote:

   When
you say "large data sets", how large?
Thanks

  

On 07/07/2014 01:39 PM, Daniel
  Sieg

Re: trying to understand yarn-client mode

2014-06-19 Thread Kevin Markey

  
  
Yarn client is much like Spark client mode, except that the
executors are running in Yarn containers managed by the Yarn
resource manager on the cluster instead of as Spark workers managed
by the Spark master.  The driver executes as a local client in your
local JVM.  It communicates with the workers on the cluster. 
Transformations are scheduled on the cluster by the driver's logic. 
Actions involve communication between local driver and remote
cluster executors.  So, there is some additional network overhead,
especially if the driver is not co-located on the cluster.  In
yarn-cluster mode -- in contrast, the driver is executed as a thread
in a Yarn application master on the cluster.  

In either case, the assembly JAR must be available to the
application on the cluster.  Best to copy it to HDFS and specify its
location by exporting its location as SPARK_JAR.

Kevin Markey

On 06/19/2014 11:22 AM, Koert Kuipers
  wrote:


  

  i am trying to understand how yarn-client
mode works. i am not using spark-submit, but instead
launching a spark job from within my own application.

  
  i can see my application contacting yarn
successfully, but then in yarn i get an immediate error:

 Application application_1403117970283_0014 failed 2
  times due to AM Container for
  appattempt_1403117970283_0014_02 exited with exitCode:
  -1000 due to: File
  file:/home/koert/test-assembly-0.1-SNAPSHOT.jar does not
  exist 
 .Failing this attempt.. Failing the application. 

  
  why is yarn trying to fetch my jar, and why
as a local file? i would expect the jar to be send to yarn
over the wire upon job submission?

  

  


  



Re: Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-22 Thread Kevin Markey
  Tom
  


  

  

  
  

  
  On
Wednesday, May 21, 2014 6:10 PM, Kevin
    Markey kevin.mar...@oracle.com
wrote:
   



  
 I tested an application on RC-10
  and Hadoop 2.3.0 in yarn-cluster mode
  that had run successfully with
  Spark-0.9.1 and Hadoop 2.3 or 2.2. 
  The application successfully ran to
  conclusion but it ultimately failed. 
  
  
  There were 2 anomalies...
  
  1. ASM reported only that the
  application was "ACCEPTED".  It never
  indicated that the application was
  "RUNNING."
  14/05/21 16:06:12 INFO
  yarn.Client: Application report
  from ASM:
 application identifier:
  application_1400696988985_0007
 appId: 7
 clientToAMToken: null
 appDiagnostics:
 appMasterHost: N/A
 appQueue: default
 appMasterRpcPort: -1
 appStartTime: 1400709970857
 yarnAppState: ACCEPTED
 distributedFinalState:
  UNDEFINED
 appTrackingUrl: http://Sleepycat:8088/proxy/application_1400696988985_0007/
 appUser: hduser
  
  Furthermore, it started a second
container, running two partly overlapping
  drivers, when it appeared that the
  application never started.  Each
  container ran to conclusion as
  explained above, taking twice as long
  as usual for both to complete.  Both
  instances had the same concluding
  failure.
  
  2. Each instance failed as indicated
  by the stderr log, finding that the filesystem
was closed when trying to clean
  up the staging directories.  
  
  14/05/21 16:08:24 INFO Executor:
Serialized size of result for 1453
is 863
  14/05/21 16:08:24 INFO
Executor: Sending result for 1453
directly to driver
  14/05/21 16:08:24 INFO
Executor: Finished task ID 1453
  14/05/21 16:08:24 INFO
TaskSetManager: Finished TID 1453 in
202 ms on localhost (progress: 2/2)
  14/05/21 16:08:24 INFO
DAGScheduler: Completed
ResultTask(1507, 1)
  14/05/21 16:08:24 INFO
TaskSchedulerImpl: Removed TaskSet
1507.0, whose tasks have all
completed, from pool
  14/05/21 16:08:24 INFO
DAGScheduler: Stage 1507 (count at
KEval.scala:32) finished in 0.417 s
  14/05/21 16:08:24 INFO
SparkContext: Job finished: count at
KEval.scala:32, took 1.532789283 s
  

Failed RC-10 yarn-cluster job for FS closed error when cleaning up staging directory

2014-05-21 Thread Kevin Markey
  16:06
/user/hduser/.sparkStaging/application_1400696988985_0007/spark-assembly-1.0.0-hadoop2.3.0.jar

Just prior to the staging directory cleanup, the application
concluded by writing results to 3 HDFS files. That occurred without
incident. 

This particular test was run using ...

1. RC10 compiled as follows: mvn -Pyarn -Phadoop-2.3
  -Dhadoop.version=2.3.0 -DskipTests clean package
2. Ran in yarn-cluster mode using spark-submit

Is there any configuration new to 1.0.0 that I might be missing. I
walked through all the changes in the Yarn deploy web page, updating
my scripts and configuration appropriately, and running except for
these two anomalies.

    Thanks
Kevin Markey



  



Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-03 Thread Kevin Markey

  
  
We are now testing precisely what you ask about in our environment.
But Sandy's questions are relevant. The bigger issue is not Spark
vs. Yarn but "client" vs. "standalone" and where the client is
located on the network relative to the cluster.

The "client" options that locate the client/master remote from the
cluster, while useful for interactive queries, suffer from
considerable network traffic overhead as the master schedules and
transfers data with the worker nodes on the cluster. The
"standalone" options locate the master/client on the cluster. In
yarn-standalone, the master is a thread contained by the Yarn
Resource Manager. Lots less traffic, as the master is co-located
with the worker nodes on the cluster and its scheduling/data
communication has less latency.

In my comparisons between yarn-client and yarn-standalone (so as not
to conflate yarn vs Spark), yarn-client computation time is at least
double yarn-standalone! At least for a job with lots of
stages and lots of client/worker communication, although rather few
"collect" actions, so it's mainly scheduling that's relevant here.

I'll be posting more information as I have it available.

Kevin


On 03/03/2014 03:48 PM, Sandy Ryza
  wrote:


  Are you running in yarn-standalone mode or
yarn-client mode? Also, what YARN scheduler and what
NodeManager heartbeat? 
  

On Sun, Mar 2, 2014 at 9:41 PM,
  polkosity polkos...@gmail.com
  wrote:
  Thanks for
the advice Mayur.

I thought I'd report back on the performance difference...
Spark standalone
mode has executors processing at capacity in under a second
:)



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html

  Sent from the Apache Spark User List
mailing list archive at Nabble.com.
  

  


  


  



Re: Is there a way to get the current progress of the job?

2014-04-01 Thread Kevin Markey

  
  
The discussion there hits on the distinction of jobs and stages.
When looking at one application, there are hundreds of stages,
sometimes thousands. Depends on the data and the task. And the UI
seems to track stages. And one could independently track them for
such a job. But what if -- as occurs in another application --
there's only one or two stages, but lots of data passing through
those 1 or 2 stages?

Kevin Markey


On 04/01/2014 09:55 AM, Mark Hamstra
  wrote:


  Some related discussion:https://github.com/apache/spark/pull/246
  

On Tue, Apr 1, 2014 at 8:43 AM, Philip
  Ogren philip.og...@oracle.com
  wrote:
  Hi DB,

Just wondering if you ever got an answer to your question
about monitoring progress - either offline or through your
own investigation. Any findings would be appreciated.

Thanks,
Philip

  

On 01/30/2014 10:32 PM, DB Tsai wrote:

  Hi guys,
  
  When we're running a very long job, we would like to
  show users the current progress of map and reduce job.
  After looking at the api document, I don't find
  anything for this. However, in Spark UI, I could see
  the progress of the task. Is there anything I miss?
  
  Thanks.
  
  Sincerely,
  
  DB Tsai
  Machine Learning Engineer
  Alpine Data Labs
  --
  Web: http://alpinenow.com/