Hi Spark, I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G. The weight size is around 1,300,000. I am quite confused that the performance is very close whether the data is cached or not.
The program is simple: points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..) points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached gradient = new LogisticGrandient(); updater = new SquaredL2Updater(); initWeight = Vectors.sparse(size, new int[]{}, new double[]{}) result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, convergeTol, maxIter, regParam, initWeight); I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster mode. Below are some arguments I am using: —executor-memory 10G —num-executors 50 —executor-cores 2 Storage Using: When caching: Cached Partitions 951 Fraction Cached 100% Size in Memory 215.7GB Size in Tachyon 0.0B Size on Disk 1029.7MB The time cost by every aggregate is around 5 minutes with cache enabled. Lots of disk IOs can be seen on the hadoop node. I have the same result with cache disabled. Should data points caching improve the performance? Should caching decrease the disk IO? Thanks in advance.