Try to reduce number of partitions to match the number of cores. We will add treeAggregate to reduce the communication cost.
PR: https://github.com/apache/spark/pull/1110 -Xiangrui On Tue, Jul 1, 2014 at 12:55 AM, Charles Li <littlee1...@gmail.com> wrote: > Hi Spark, > > I am running LBFGS on our user data. The data size with Kryo serialisation is > about 210G. The weight size is around 1,300,000. I am quite confused that the > performance is very close whether the data is cached or not. > > The program is simple: > points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..) > points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached > gradient = new LogisticGrandient(); > updater = new SquaredL2Updater(); > initWeight = Vectors.sparse(size, new int[]{}, new double[]{}) > result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, > convergeTol, maxIter, regParam, initWeight); > > I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its > cluster mode. Below are some arguments I am using: > —executor-memory 10G > —num-executors 50 > —executor-cores 2 > > Storage Using: > When caching: > Cached Partitions 951 > Fraction Cached 100% > Size in Memory 215.7GB > Size in Tachyon 0.0B > Size on Disk 1029.7MB > > The time cost by every aggregate is around 5 minutes with cache enabled. Lots > of disk IOs can be seen on the hadoop node. I have the same result with cache > disabled. > > Should data points caching improve the performance? Should caching decrease > the disk IO? > > Thanks in advance.