Hi Gary, I gave this a shot on a test cluster of CDH4.7 and actually saw a regression in performance when running the numbers. Have you done any benchmarking? Below are my numbers:
Experimental method: 1. Write 14GB of data to HDFS via [1] 2. Read data multiple times via [2] *Experiment 1: run on virtual machines* With short-circuit read *disabled*: 14/09/24 15:10:49 INFO spark.SparkContext: Job finished: saveAsTextFile at <console>:13, took 344.931469949 s 14/09/24 15:11:30 INFO spark.SparkContext: Job finished: count at <console>:13, took 18.601568871 s 14/09/24 15:11:54 INFO spark.SparkContext: Job finished: count at <console>:13, took 16.531909024 s 14/09/24 15:12:18 INFO spark.SparkContext: Job finished: count at <console>:13, took 17.639692651 s 14/09/24 15:12:38 INFO spark.SparkContext: Job finished: count at <console>:13, took 16.773438345 s With short-circuit read *enabled*: 14/09/24 14:28:38 INFO spark.SparkContext: Job finished: saveAsTextFile at <console>:13, took 299.511103592 s 14/09/24 14:29:17 INFO spark.SparkContext: Job finished: count at <console>:13, took 22.459146194 s 14/09/24 14:29:44 INFO spark.SparkContext: Job finished: count at <console>:13, took 19.806642815 s 14/09/24 14:30:11 INFO spark.SparkContext: Job finished: count at <console>:13, took 20.284644308 s 14/09/24 14:30:40 INFO spark.SparkContext: Job finished: count at <console>:13, took 21.720455219 s My summary hear is that enabling short-circuit read caused the write to go faster (what?) and caused a slight decrease in read performance, from ~17sec to ~20sec. The VMs were backed by FusionIO drives but I thought maybe there was something funky with the VMs so switched to bare hardware in a second experiment. *Experiment 2: run on bare hardware* With short-circuit read *disabled*: 14/09/24 15:59:11 INFO spark.SparkContext: Job finished: saveAsTextFile at <console>:13, took 1605.965203162 s 14/09/24 15:59:39 INFO spark.SparkContext: Job finished: count at <console>:13, took 11.984355461 s 14/09/24 16:00:00 INFO spark.SparkContext: Job finished: count at <console>:13, took 11.134712764 s 14/09/24 16:00:11 INFO spark.SparkContext: Job finished: count at <console>:13, took 8.694292372 s 14/09/24 16:00:24 INFO spark.SparkContext: Job finished: count at <console>:13, took 9.83986823 s With short-circuit read *enabled*: 14/09/24 16:23:14 INFO spark.SparkContext: Job finished: saveAsTextFile at <console>:13, took 1113.897715871 s 14/09/24 16:25:19 INFO spark.SparkContext: Job finished: count at <console>:13, took 14.249690605 s 14/09/24 16:25:47 INFO spark.SparkContext: Job finished: count at <console>:13, took 12.67330165 s 14/09/24 16:26:04 INFO spark.SparkContext: Job finished: count at <console>:13, took 10.673825924 s 14/09/24 16:26:19 INFO spark.SparkContext: Job finished: count at <console>:13, took 9.722516379 s This is separate hardware so the numbers are very different (it's not just bypassing the VM overhead). Again, the writes are much faster (1605s -> 1113s) but the reads are comparable if not slightly slower (~10.4s -> ~11.8s) To make sure that short circuit reads were actually working I looked at the datanode logs and saw the following line. I think this confirms that a) the read was local (127.0.0.1 -> 127.0.0.1) from Spark and b) short-circuit read was successfully used ("success: true"). hadoop-datanode-mybox.local.log:2014-09-24 16:26:52,800 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_FDS, blockid: -312380305519226759, srvID: DS-96112752-10.201.12.105-50010-1411586696381, success: true Has anyone actually deployed this feature and benchmarked gains? I was hoping to throw this switch on my clusters and get a 30% perf boost but in practice that has not materialized. Cheers! Andrew [1] sc.parallelize(1 to (14*1024*1024)).map(k => Seq(k, org.apache.commons.lang.RandomStringUtils.random(1024, "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWxyZ0123456789")).mkString("|")).saveAsTextFile("hdfs:///tmp/output") [2] sc.textFile("hdfs:///tmp/output").count On Wed, Sep 17, 2014 at 11:19 AM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > I'm pretty sure it does help, though I don't have any numbers for it. In > any case, Spark will automatically benefit from this if you link it to a > version of HDFS that contains this. > > Matei > > On September 17, 2014 at 5:15:47 AM, Gary Malouf (malouf.g...@gmail.com) > wrote: > > Cloudera had a blog post about this in August 2013: > http://blog.cloudera.com/blog/2013/08/how-improved-short-circuit-local-reads-bring-better-performance-and-security-to-hadoop/ > > Has anyone been using this in production - curious as to if it made a > significant difference from a Spark perspective. > >