Hi Tathagata, It's a standalone cluster. The submit commands are:
== CLIENT spark-submit --class com.fake.Test \ --deploy-mode client --master spark://fake.com:7077 \ fake.jar <arguments> == CLUSTER spark-submit --class com.fake.Test \ --deploy-mode cluster --master spark://fake.com:7077 \ s3n://fake.jar <arguments> And they are both occupying all available slots. (8 * 2 machine = 16 slots). ᐧ On Thu, Jan 1, 2015 at 12:21 AM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > Whats your spark-submit commands in both cases? Is it Spark Standalone or > YARN (both support client and cluster)? Accordingly what is the number of > executors/cores requested? > > TD > > On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji <eshi...@gmail.com> wrote: > >> Also the job was deployed from the master machine in the cluster. >> >> On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <eshi...@gmail.com> wrote: >> >>> Oh sorry that was a edit mistake. The code is essentially: >>> >>> val msgStream = kafkaStream >>> .map { case (k, v) => v} >>> .map(DatatypeConverter.printBase64Binary) >>> .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) >>> >>> I.e. there is essentially no original code (I was calling saveAsTextFile >>> in a "save" function but that was just a remnant from previous debugging). >>> >>> >>> >>> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote: >>> >>>> -dev, +user >>>> >>>> A decent guess: Does your 'save' function entail collecting data back >>>> to the driver? and are you running this from a machine that's not in >>>> your Spark cluster? Then in client mode you're shipping data back to a >>>> less-nearby machine, compared to with cluster mode. That could explain >>>> the bottleneck. >>>> >>>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshi...@gmail.com> wrote: >>>> > Hi, >>>> > >>>> > I have a very, very simple streaming job. When I deploy this on the >>>> exact >>>> > same cluster, with the exact same parameters, I see big (40%) >>>> performance >>>> > difference between "client" and "cluster" deployment mode. This seems >>>> a bit >>>> > surprising.. Is this expected? >>>> > >>>> > The streaming job is: >>>> > >>>> > val msgStream = kafkaStream >>>> > .map { case (k, v) => v} >>>> > .map(DatatypeConverter.printBase64Binary) >>>> > .foreachRDD(save) >>>> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) >>>> > >>>> > I tried several times, but the job deployed with "client" mode can >>>> only >>>> > write at 60% throughput of the job deployed with "cluster" mode and >>>> this >>>> > happens consistently. I'm logging at INFO level, but my application >>>> code >>>> > doesn't log anything so it's only Spark logs. The logs I see in >>>> "client" >>>> > mode doesn't seem like a crazy amount. >>>> > >>>> > The setup is: >>>> > spark-ec2 [...] \ >>>> > --copy-aws-credentials \ >>>> > --instance-type=m3.2xlarge \ >>>> > -s 2 launch test_cluster >>>> > >>>> > And all the deployment was done from the master machine. >>>> > >>>> > ᐧ >>>> >>> >>> >> >