Yes, having many more cores than disks and all writing at the same time can definitely cause performance issues. Though that wouldn't explain the high GC. What percent of task time does the web UI report that tasks are spending in GC?
On Fri, Feb 6, 2015 at 12:56 AM, Guillermo Ortiz <konstt2...@gmail.com> wrote: > Yes, It's surpressing to me as well.... > > I tried to execute it with different configurations, > > sudo -u hdfs spark-submit --master yarn-client --class > com.mycompany.app.App --num-executors 40 --executor-memory 4g > Example-1.0-SNAPSHOT.jar hdfs://ip:8020/tmp/sparkTest/ file22.bin > parameters > > This is what I executed with different values in num-executors and > executor-memory. > What do you think there are too many executors for those HDDs? Could > it be the reason because of each executor takes more time? > > 2015-02-06 9:36 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>: > > That's definitely surprising to me that you would be hitting a lot of GC > for > > this scenario. Are you setting --executor-cores and --executor-memory? > > What are you setting them to? > > > > -Sandy > > > > On Thu, Feb 5, 2015 at 10:17 AM, Guillermo Ortiz <konstt2...@gmail.com> > > wrote: > >> > >> Any idea why if I use more containers I get a lot of stopped because GC? > >> > >> 2015-02-05 8:59 GMT+01:00 Guillermo Ortiz <konstt2...@gmail.com>: > >> > I'm not caching the data. with "each iteration I mean,, each 128mb > >> > that a executor has to process. > >> > > >> > The code is pretty simple. > >> > > >> > final Conversor c = new Conversor(null, null, null, > >> > longFields,typeFields); > >> > SparkConf conf = new SparkConf().setAppName("Simple Application"); > >> > JavaSparkContext sc = new JavaSparkContext(conf); > >> > JavaRDD<byte[]> rdd = sc.binaryRecords(path, c.calculaLongBlock()); > >> > > >> > JavaRDD<String> rddString = rdd.map(new Function<byte[], String>() { > >> > @Override > >> > public String call(byte[] arg0) throws Exception { > >> > String result = c.parse(arg0).toString(); > >> > return result; > >> > } > >> > }); > >> > rddString.saveAsTextFile(url + "/output/" + > System.currentTimeMillis()+ > >> > "/"); > >> > > >> > The parse function just takes an array of bytes and applies some > >> > transformations like,,, > >> > [0..3] an integer, [4...20] an String, [21..27] another String and so > >> > on. > >> > > >> > It's just a test code, I'd like to understand what it's happeing. > >> > > >> > 2015-02-04 18:57 GMT+01:00 Sandy Ryza <sandy.r...@cloudera.com>: > >> >> Hi Guillermo, > >> >> > >> >> What exactly do you mean by "each iteration"? Are you caching data > in > >> >> memory? > >> >> > >> >> -Sandy > >> >> > >> >> On Wed, Feb 4, 2015 at 5:02 AM, Guillermo Ortiz < > konstt2...@gmail.com> > >> >> wrote: > >> >>> > >> >>> I execute a job in Spark where I'm processing a file of 80Gb in > HDFS. > >> >>> I have 5 slaves: > >> >>> (32cores /256Gb / 7physical disks) x 5 > >> >>> > >> >>> I have been trying many different configurations with YARN. > >> >>> yarn.nodemanager.resource.memory-mb 196Gb > >> >>> yarn.nodemanager.resource.cpu-vcores 24 > >> >>> > >> >>> I have tried to execute the job with different number of executors a > >> >>> memory (1-4g) > >> >>> With 20 executors takes 25s each iteration (128mb) and it never has > a > >> >>> really long time waiting because GC. > >> >>> > >> >>> When I execute around 60 executors the process time it's about 45s > and > >> >>> some tasks take until one minute because GC. > >> >>> > >> >>> I have no idea why it's calling GC when I execute more executors > >> >>> simultaneously. > >> >>> The another question it's why it takes more time to execute each > >> >>> block. My theory about the this it's because there're only 7 > physical > >> >>> disks and it's not the same 5 processes writing than 20. > >> >>> > >> >>> The code is pretty simple, it's just a map function which parse a > line > >> >>> and write the output in HDFS. There're a lot of substrings inside of > >> >>> the function what it could cause GC. > >> >>> > >> >>> Any theory about? > >> >>> > >> >>> > --------------------------------------------------------------------- > >> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >> >>> For additional commands, e-mail: user-h...@spark.apache.org > >> >>> > >> >> > > > > >