Questions about disk IOs

Charles Li Tue, 01 Jul 2014 00:57:13 -0700

Hi Spark,

I am running LBFGS on our user data. The data size with Kryo serialisation is 
about 210G. The weight size is around 1,300,000. I am quite confused that the 
performance is very close whether the data is cached or not.


The program is simple:
points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
gradient = new LogisticGrandient();
updater = new SquaredL2Updater();
initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, 
convergeTol, maxIter, regParam, initWeight);

I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster 
mode. Below are some arguments I am using:
—executor-memory 10G
—num-executors 50
—executor-cores 2

Storage Using:
When caching:
Cached Partitions 951
Fraction Cached 100%
Size in Memory 215.7GB
Size in Tachyon 0.0B
Size on Disk 1029.7MB

The time cost by every aggregate is around 5 minutes with cache enabled. Lots 
of disk IOs can be seen on the hadoop node. I have the same result with cache 
disabled.

Should data points caching improve the performance? Should caching decrease the 
disk IO?

Thanks in advance.

Questions about disk IOs

Reply via email to