Re: JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
t; besides a Writable. > > On Mon, Jul 25, 2016, 18:50 Jia Zou <jacqueline...@gmail.com> wrote: > >> >> My code is as following: >> >> System.out.println("Initialize points..."); &g

JavaRDD.foreach (new VoidFunction<>...) always returns the last element

2016-07-25 Thread Jia Zou
My code is as following: System.out.println("Initialize points..."); JavaPairRDD data = sc.sequenceFile(inputFile, IntWritable.class, DoubleArrayWritable.class);

Re: how to calculate -- executor-memory,num-executors,total-executor-cores

2016-02-02 Thread Jia Zou
Divya, According to my recent Spark tuning experiences, optimal executor-memory size not only depends on your workload characteristics (e.g. working set size at each job stage) and input data size, but also depends on your total available memory and memory requirements of other components like

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-02-01 Thread Jia Zou
Hi, Calvin, I am running 24GB data Spark KMeans in a c3.2xlarge AWS instance with 30GB physical memory. Spark will cache data off-heap to Tachyon, the input data is also stored in Tachyon. Tachyon is configured to use 15GB memory, and use tired store. Tachyon underFS is /tmp. The only

[Problem Solved]Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
Hi, dears, the problem has been solved. I mistakely use tachyon.user.block.size.bytes instead of tachyon.user.block.size.bytes.default. It works now. Sorry for the confusion and thanks again to Gene! Best Regards, Jia On Wed, Jan 27, 2016 at 4:59 AM, Jia Zou <jacqueline...@gmail.com>

TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
Dears, I keep getting below exception when using Spark 1.6.0 on top of Tachyon 0.8.2. Tachyon is 93% used and configured as CACHE_THROUGH. Any suggestions will be appreciated, thanks! = Exception in thread "main" org.apache.spark.SparkException: Job aborted

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
) at java.io.BufferedInputStream.read(BufferedInputStream.java:334) at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127) ... 15 more On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote: > Dears, I keep getting below exception when using Sp

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) On Wed, Jan 27, 2016 at 5:53 AM, Jia Zou <jacqueline...@gmail.com> wrote: > BTW. The tachyon worker log says

Re: Spark partition size tuning

2016-01-27 Thread Jia Zou
extraJavaOptions per job, or adding it to > tachyon-site.properties. > > I hope that helps, > Gene > > On Mon, Jan 25, 2016 at 8:13 PM, Jia Zou <jacqueline...@gmail.com> wrote: > >> Dear all, >> >> First to update that the local file system data p

Re: TTransportException when using Spark 1.6.0 on top of Tachyon 0.8.2

2016-01-27 Thread Jia Zou
-10-73-198-35:7077 /home/ubuntu/HiBench/src/sparkbench/target/sparkbench-5.0-SNAPSHOT-MR2-spark1.5-jar-with-dependencies.jar tachyon://localhost:19998/Kmeans/Input/samples 10 5 On Wed, Jan 27, 2016 at 5:02 AM, Jia Zou <jacqueline...@gmail.com> wrote: > Dears, I keep getting below excep

Fwd: Spark partition size tuning

2016-01-25 Thread Jia Zou
thod can't work for Tachyon data. Do you have any suggestions? Thanks very much! Best Regards, Jia -- Forwarded message ------ From: Jia Zou <jacqueline...@gmail.com> Date: Thu, Jan 21, 2016 at 10:05 PM Subject: Spark partition size tuning To: "user @spark" <user

Can Spark read input data from HDFS centralized cache?

2016-01-25 Thread Jia Zou
I configured HDFS to cache file in HDFS's cache, like following: hdfs cacheadmin -addPool hibench hdfs cacheadmin -addDirective -path /HiBench/Kmeans/Input -pool hibench But I didn't see much performance impacts, no matter how I configure dfs.datanode.max.locked.memory Is it possible that

Spark partition size tuning

2016-01-21 Thread Jia Zou
Dear all! When using Spark to read from local file system, the default partition size is 32MB, how can I increase the partition size to 128MB, to reduce the number of tasks? Thank you very much! Best Regards, Jia

Can I configure Spark on multiple nodes using local filesystem on each node?

2016-01-19 Thread Jia Zou
Dear all, Can I configure Spark on multiple nodes without HDFS, so that output data will be written to the local file system on each node? I guess there is no such feature in Spark, but just want to confirm. Best Regards, Jia

Re: Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
hat is a Hadoop mapreduce concept, not > Spark. > > On Sun, Jan 17, 2016 at 7:29 AM, Jia Zou <jacqueline...@gmail.com> wrote: > >> Dear all, >> >> Is there a way to reuse executor JVM across different JobContexts? Thanks. >> >> Best Regards, >> Jia >> > >

Reuse Executor JVM across different JobContext

2016-01-17 Thread Jia Zou
Dear all, Is there a way to reuse executor JVM across different JobContexts? Thanks. Best Regards, Jia

org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Jia Zou
Dear all, I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with inputRDD.persist(StorageLevel.OFF_HEAP()). I've set tired storage for Tachyon. It is all right when working set is smaller than available memory. However, when working set exceeds available memory, I keep getting errors like

Re: Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-31 Thread Jia Zou
store the partitions that don't fit on disk and read them from there when > they are needed. > Actually, it's not necessary to set so large driver memory in your case, > because KMeans use low memory for driver if your k is not very large. > > Cheers > Yanbo > > 2015-12-30 22:20

Spark MLLib KMeans Performance on Amazon EC2 M3.2xlarge

2015-12-30 Thread Jia Zou
I am running Spark MLLib KMeans in one EC2 M3.2xlarge instance with 8 CPU cores and 30GB memory. Executor memory is set to 15GB, and driver memory is set to 15GB. The observation is that, when input data size is smaller than 15GB, the performance is quite stable. However, when input data becomes

How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
My goal is to use hprof to profile where the bottleneck is. Is there anyway to do this without modifying and rebuilding Spark source code. I've tried to add " -Xrunhprof:cpu=samples,depth=100,interval=20,lineno=y,thread=y,file=/home/ubuntu/out.hprof" to spark-class script, but it can only profile

Re: How to use HProf to profile Spark CPU overhead

2015-12-12 Thread Jia Zou
Hi, Ted, it works, thanks a lot for your help! --Jia On Sat, Dec 12, 2015 at 3:01 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Have you tried adding the option below through > spark.executor.extraJavaOptions ? > > Cheers > > > On Dec 13, 2015, at 3:36 AM, Jia Zou <