I got some time to look in to it. It appears as that Spark (latest git) is doing this operation much more often compare to Aug 1 version. Here is the log from operation I am referring to
14/08/19 12:37:26 INFO spark.CacheManager: Partition rdd_8_414 not found, computing it 14/08/19 12:37:26 INFO rdd.HadoopRDD: Input split: hdfs://test/test_flows/test-2014-05-06.csv:9529458688+134217728 14/08/19 12:37:41 INFO python.PythonRDD: Times: total = 16312, boot = 8, init = 134, finish = 16170 14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for [dstip]: org.apache.spark.sql.columnar.compression.PassThrough$Encoder@6374d682, ratio: 1.0 14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for [dstport]: org.apache.spark.sql.columnar.compression.PassThrough$Encoder@baf23d1, ratio: 1.0 14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for [srcport]: org.apache.spark.sql.columnar.compression.PassThrough$Encoder@17587455, ratio: 1.0 14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for [stime]: org.apache.spark.sql.columnar.compression.PassThrough$Encoder@303d846c, ratio: 1.0 14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for [endtime]: org.apache.spark.sql.columnar.compression.PassThrough$Encoder@16c0e732, ratio: 1.0 14/08/19 12:37:41 INFO columnar.StringColumnBuilder: Compressor for [srcip]: org.apache.spark.sql.columnar.compression.PassThrough$Encoder@528a8f49, ratio: 1.0 14/08/19 12:37:41 INFO storage.MemoryStore: ensureFreeSpace(64834432) called with curMem=1556288334, maxMem=9446715555 In Aug 1 version the log file from processing the same size data is of approx 136KB where from latest git it is of 23 MB. The only message which makes logfile grow is the mentioned above. It appears the latest git version has an issue when reading data and converting it columnar format. As this conversion happens when Spark is trying to create a RDD, once for each RDD. In latest git version it just simply might be doing for each record in RDD. That's what causing it to slow read from disk as it spends time in this operation. Any suggestion/help in this regard will be helpful. - Gurvinder On 08/14/2014 10:27 AM, Gurvinder Singh wrote: > Hi, > > I am running spark from the git directly. I recently compiled the newer > version Aug 13 version and it has performance drop of 2-3x in read from > HDFS compare to git version of Aug 1. So I am wondering which commit > would have cause such an issue in read performance. The performance is > almost same once data is cached in memory, but read from HDFS is well > slow compare to Aug 1 version. > > - Gurvinder > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org